Ways to accelerate 256-bit arithmetic with 64-bit ops?

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: bigvito19 on April 06, 2021, 08:01:24 AM

How is this coming along?

I didn't get a chance to try this yet.

bigvito19

full member

Activity: 706

Merit: 111

Quote from: NotATether on April 01, 2021, 01:29:12 AM

Looks like it works for subtraction too.

I suppose that to chain a bunch of multiplications at once I could use 8 blocks with the lower 31 bits used, and the remainder of bits I need to have 256 math go in the highest block since I don't have to worry about overflow, and that'll let me do 1 multiplication without a carry.

For 2 multiplications I'd use the lower 30 bits, for 4 I'd use 29 and so on. This'll allow me to mix many additions/subtractions and multiplications at once and go without carrying for a while.

How is this coming along?

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Looks like it works for subtraction too.

I suppose that to chain a bunch of multiplications at once I could use 8 blocks with the lower 31 bits used, and the remainder of bits I need to have 256 math go in the highest block since I don't have to worry about overflow, and that'll let me do 1 multiplication without a carry.

For 2 multiplications I'd use the lower 30 bits, for 4 I'd use 29 and so on. This'll allow me to mix many additions/subtractions and multiplications at once and go without carrying for a while.

gmaxwell

staff

Activity: 4326

Merit: 8951

Quote from: NotATether on March 31, 2021, 04:55:07 AM

I'm curious though, why 52 bits in particular? For addition/subtraction the safe zone where digits don't clobber carry bits and you can distinguish overflow from a real result is the lower 63 bits, and for multiplication it's the lower 32-bits.

256/5 = 52. The idea is that you add an extra word, and now you have extra space. You can now you can make 12 additions of these 52 bits in a row without processing carries.

The carry processing would still be mostly sequential, but you delay doing it. So you can add totally in parallel a bunch of times and then after that propagate the carries all at once.

E.g. look at the FE add function in libsecp256k1:

https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52_impl.h#L405

Code:

SECP256K1_INLINE static void secp256k1_fe_add(secp256k1_fe *r, const secp256k1_fe *a) {
r->n[0] += a->n[0];
r->n[1] += a->n[1];
r->n[2] += a->n[2];
r->n[3] += a->n[3];
r->n[4] += a->n[4];
}

(with the verification code removed)

-- it's embarrassingly parallel.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

Quote from: gmaxwell on March 30, 2021, 08:06:33 PM

If you can't get multiple independent 256 bit operations in parallel the next alternative is to use deferred carries. Currently your 256 bit number is represented as 4 64-bit 'digits'. If you instead represent it as 5 52-bit digits then you can perform 12 successive additions without overflowing and then process the carries afterwards.

I'm curious though, why 52 bits in particular? For addition/subtraction the safe zone where digits don't clobber carry bits and you can distinguish overflow from a real result is the lower 63 bits, and for multiplication it's the lower 32-bits.

I'm still not sure how to process the carry bits simultaneously though. Presumably I could set the first two [for example] of these "words" at once using something like _mm_set_epi64(bits52[1], bits52[0]), add another set of numbers I made using _mm_add_epi64, but the problem here is how am I going to add the carry bits over to the words without doing a bunch of performance-killing _mm_extract_* instructions, which is slow, because now I have to extract the carry bits for each number one by one instead of in parallel, and is possibly slower than what I'm using now:

Code:

// c = a + b
int carry = _add_carry_u64(carry, a.bits64[0], b.bits64[0], c.bits64 + 0)
carry = _add_carry_u64(carry, a.bits64[1], b.bits64[1], c.bits64 + 1)
carry = _add_carry_u64(carry, a.bits64[2], b.bits64[2], c.bits64 + 2)
carry = _add_carry_u64(carry, a.bits64[3], b.bits64[3], c.bits64 + 3)

Not counting loads & stores, this uses 4 ADD instructions and ideally I would like a way to "add" 2 or 4 of these words in one instruction, a second instruction to shift the carry bits one 64-bit (or 32-bit) word to the left and then a third instruction that adds the carry bits to the subsequent words.

gmaxwell

staff

Activity: 4326

Merit: 8951

Quote from: NotATether on March 30, 2021, 07:07:59 PM

I am working on some code (Jean Luc PONS kangaroo program to be specific) that currently stores 256-bit ints in blocks of 64-bit integers and sequentially adds/subtracts/multiplies each block using carry bits.

I wanted to vectorize these operations but the carry bits prevent me from doing so. Hardware acceleration (SSE/AVX in particular) will only operate on blocks of 64-bit ints.

Does anyone know any algorithms floating around for basic arithmetic like addition et al that don't need the carry bit, and can be scaled up to pairs of 4 blocks of 64-bit ints?

256-bit numbers are used to represent public and private keys.

The much easier way to parallelize it is to arrange for things to run 4 independent 256 bit operations in parallel. Tongue

(unfortunately, SSE2 etc actually only give you 32 bit multipliers which is a huge performance punch in the gut)

If you can't get multiple independent 256 bit operations in parallel the next alternative is to use deferred carries. Currently your 256 bit number is represented as 4 64-bit 'digits'. If you instead represent it as 5 52-bit digits then you can perform 12 successive additions without overflowing and then process the carries afterwards.

The 64-bit field code in libsecp256k1 works this way.

NotATether

legendary

Activity: 1568

Merit: 6660

bitcoincleanup.com / bitmixlist.org

I am working on some code (Jean Luc PONS kangaroo program to be specific) that currently stores 256-bit ints in blocks of 64-bit integers and sequentially adds/subtracts/multiplies each block using carry bits.

I wanted to vectorize these operations but the carry bits prevent me from doing so. Hardware acceleration (SSE/AVX in particular) will only operate on blocks of 64-bit ints.

Does anyone know any algorithms floating around for basic arithmetic like addition et al that don't need the carry bit, and can be scaled up to pairs of 4 blocks of 64-bit ints?

256-bit numbers are used to represent public and private keys.

Topic: Ways to accelerate 256-bit arithmetic with 64-bit ops? (Read 264 times)