secp256k1 library and Intel cpu | Bitcointalksearch.org

mrx23dot

newbie

Activity: 2

Merit: 0

Hi All,

What's the current fastest version of secp256k1_fe_mul for i5/i7?
Currently that's my bottleneck on secp256k1_unsafe lib. Which has the same asm as the official.

One of the ASM in this thread caused segfault for me, the other was slower than the original.

% cumulative self self total
time seconds seconds calls ms/call ms/call name
46.24 4.67 4.67 235910453 0.00 0.00 secp256k1_fe_mul_inner
13.71 6.06 1.39 88835758 0.00 0.00 secp256k1_fe_sqr_inner
10.05 7.07 1.02 2 507.51 507.51 sha256_transf

Jude Austin

legendary

Activity: 1140

Merit: 1000

The Real Jude Austin

Quote from: piotr_n on January 16, 2017, 05:24:34 PM

You are not a wallet thief, man - you've done nothing wrong in that matters.
I am also not a wallet thief, even though the man has also somehow accused me of such.

We are all dicks though.

If I was to be upset by all the names that different kind of dicks call me, I'd have been upset all me life.
It's a public forum, people have different characters which are going to clash and you have to know how to deal with it - especially if you are older than them.
But that is not the point.

The point is that I have been here for years and if someone asked me today how I have contributed into an actual "technical development" of bitcoin by my activity on this "Development & Technical Discussion" forum, I'd have said: I haven't at all.
Although I have certainly made my dick bigger here.
And the sad part is that everybody seems to be here for that reason.

We are all definitely dicks, lol.

HostFat

staff

Activity: 4270

Merit: 1209

I support freedom of choice

I'm not sure if you will find them better, but on bitco.in/forum has some places for technical discussions. Maybe, or maybe not, you can find other opinions/help about this and other objects.

piotr_n

legendary

Activity: 2058

Merit: 1421

aka tonikt

You are not a wallet thief, man - you've done nothing wrong in that matters.
I am also not a wallet thief, even though the man has also somehow accused me of such.

We are all dicks though.

If I was to be upset by all the names that different kind of dicks call me, I'd have been upset all me life.
It's a public forum, people have different characters which are going to clash and you have to know how to deal with it - especially if you are older than them.
But that is not the point.

The point is that I have been here for years and if someone asked me today how I have contributed into an actual "technical development" of bitcoin by my activity on this "Development & Technical Discussion" forum, I'd have said: I haven't at all.
Although I have certainly made my dick bigger here.
And the sad part is that everybody seems to be here for that reason.

rico666

legendary

Activity: 1120

Merit: 1037

฿ → ∞

Quote from: piotr_n on January 16, 2017, 04:26:15 PM

I really wish this forum would sometimes happen to be about an actual technical discussion and not always about who has a bigger dick.

How can that be if it's moderated by a bearded dwarf (with the according dick size) acting out his moderator power fantasies.
I mean look what happened:

I contributed to secp256k1, but then I made the fatal mistake to criticize it and the even more fatal mistake to criticize his holiness Maxwell the 1st.

Immediately I am - according to treebeard extraordinaire - "A wallet thief", "incompetent" and what not.
Plus of course the usual negative trust rating he gives in these cases. Yay - my 1st negative trust rating here ever.

How can you have a technical discussion, if the gnome here is swinging his sceptre. Not benevolently I might add if you happen to not wank his ego.

Then technical discussion happens via PM (thanks AlexGR et al.) or elsewhere and result in non-public solutions.

Yep guys - there you have it: ~~Your glorious core developer deity barring people from bringing their ideas/work in.~~ Sorry, not exact. *Taking their contribution and fuck them over afterwards.* Yep more like it.

Rico

piotr_n

legendary

Activity: 2058

Merit: 1421

aka tonikt

I really wish this forum would sometimes happen to be about an actual technical discussion and not always about who has a bigger dick.

rico666

legendary

Activity: 1120

Merit: 1037

฿ → ∞

Quote from: gmaxwell on January 16, 2017, 10:27:16 AM

Quote from: rico666 on January 16, 2017, 01:11:33 AM

Instead, when we meet at the next Bitcoin event we'll both be attending, I'll approach you and we'll handle our arguments like real men. Promise.

"Violence is the last refuge of the incompetent."

Interesting. So real men are incompetent in your eyes. I thought real men (including intelligence) have a beer.

My bad. Assuming intelligence. Won't happen again. Sorry.

Rico

gmaxwell

staff

Activity: 4326

Merit: 8951

Quote from: rico666 on January 16, 2017, 01:11:33 AM

Instead, when we meet at the next Bitcoin event we'll both be attending, I'll approach you and we'll handle our arguments like real men. Promise.

"Violence is the last refuge of the incompetent."

rico666

legendary

Activity: 1120

Merit: 1037

฿ → ∞

Quote from: gmaxwell on January 15, 2017, 01:05:56 PM

"I started making keys, starting with ones with fewest cuts and systematically working through all possibilities. To learn if these keys matched any that had been used in the past, I tried each one in every door in the neighborhood. After a bit I found a few valuables. What was I supposed to do, leave them there?"

Yeah. I had lot's of these discussions. Your comparison doesn't apply - even remotely.

"I started taking walks in the park - systematically taking paths to cover the whole area. From time to time I find some coins. What am I supposed to do, leave them there?"

The doors in the neighborhood have names on them. And yes, even "for finds in the park" rules apply. We adhere to them.

You are lucky, this night the pool found something again. The funds are still on the address. What would be your take on this now?

It's a rhetoric question, I do not really need your input. As promised I slept over our - for me yesterdays - "conversation". I guess I'll leave the lawyers in their box this time. Instead, when we meet at the next Bitcoin event we'll both be attending, I'll approach you and we'll handle our arguments like real men. Promise.

Rico

gmaxwell

staff

Activity: 4326

Merit: 8951

"I started making keys, starting with ones with fewest cuts and systematically working through all possibilities. To learn if these keys matched any that had been used in the past, I tried each one in every door in the neighborhood. After a bit I found a few valuables. What was I supposed to do, leave them there?"

rico666

legendary

Activity: 1120

Merit: 1037

฿ → ∞

Quote from: gmaxwell on January 15, 2017, 11:20:21 AM

A normal move is constant time but not conditional, so "constant time move" wouldn't make much sense.

It was a hint to the mnemonics used. "cctmov ctcmov ..."

But I understand the importance of constant time now better (especially why https://github.com/llamasoft/secp256k1_fast_unsafe proclaims itself as being unsafe).

Quote

If so, only by a highly inefficient means. hash160 is a hash function, there is no EC involved at all. If you were merely attempting to find a hash160 collision you'd simply need to run hash160. You also wouldn't have any reason to check your results against assets in the bitcoin chain... You might also use an efficient collision finding algorithm instead of brute force.

I know the hash160 is only the RIPEMD160(SHA256(x)) part of the generation process. So maybe I'm just missing the nomenclature here for "full generation chain collision". Unfortunately such a thing is not mentioned here https://bitcointalksearch.org/topic/reward-offered-for-hash-collisions-for-sha1-sha256-ripemd160-and-other-293382

The reason why it's checking against the funds is explained in the text I referenced.

Quote

If you have evidence that I have stolen something in the past or evidence I plan to steal something in the future, please either present that evidence

You stole funds in this transaction: https://blockchain.info/tx/e094692e7d198500480ff5de3d6816e5054708bdea77f3c7db2fc3263f776b75

Ah. And I did so in "full disclosure", haven't touched the funds on the custody address and if you point me to the rightful owner you are aware he gets not only his funds back, but also quite a bounty from RyanC and I believe a round-up from myself?

I mean do you even think before you write some things? Right now I am

doing the LBC project as a hobby
never took any money/donation for it (although offered) and don't intend to
except the users testing, never wanted anyone to "slavishly" contribute any work
on the contrary, I contributed off-spins from this to other projects (secp256k1, Linux Kernel - which will have a faster RIPEMD160 because of me)

Actually, it looks like I have the fastest RIPEMD160 CPU implementation at the moment.

I can accept, that you call the LBC project "inefficient", "badly designed" or whatever as response to my criticism of the libsecp256k1, but pulling the "you're a thief" card is really low.

If you consider transferring the 0.00778379 BTC funds to custody "stealing", you should really work on your judgement. What should I have had done instead? Leave them alone, just saying "Pool has found something, but we won't tell you what and where" Huh

With the "pool search front" being quite visible even this announcement would lead to jeopardizing the funds. So what then? Not doing it at all? Maybe you need more poeple to tell you to stop working in the Crypto world and breed sheep instead.

If you have suggestions for an escrow for the finds, come forth.

Quote

And if you continue with your allegation(s), I really have no problem to step out of (pseudo)-anonymity and let my lawyers drill your sorry ass until they hit crude oil. And this is not even a rude statement.

Bring it on.

I see. Let me sleep a night over it and then decide if educating you is worth the effort.

Rico

gmaxwell

staff

Activity: 4326

Merit: 8951

Quote from: rico666 on January 15, 2017, 10:56:23 AM

Ok. Hasn't explained why cmov implementation looks like it does, because right now - after this explanation - it seems to me cmov is short for "constant time move"

It is a conditional move, and does exactly what the comment says it does: "If flag is true, set *r equal to *a; otherwise leave it. Constant-time.". The construction used for it is one that results in constant time behavior on our target systems, isn't undermined by the compiler, and which is relatively fast. A normal move is constant time but not conditional, so "constant time move" wouldn't make much sense.

Quote

I am trying to find a hash160 collision.

If so, only by a highly inefficient means. hash160 is a hash function, there is no EC involved at all. If you were merely attempting to find a hash160 collision you'd simply need to run hash160. You also wouldn't have any reason to check your results against assets in the bitcoin chain... You might also use an efficient collision finding algorithm instead of brute force.

Quote

If you have evidence that I have stolen something in the past or evidence I plan to steal something in the future, please either present that evidence

You stole funds in this transaction: https://blockchain.info/tx/e094692e7d198500480ff5de3d6816e5054708bdea77f3c7db2fc3263f776b75

Quote

And if you continue with your allegation(s), I really have no problem to step out of (pseudo)-anonymity and let my lawyers drill your sorry ass until they hit crude oil. And this is not even a rude statement.

Bring it on.

rico666

legendary

Activity: 1120

Merit: 1037

฿ → ∞

Quote from: gmaxwell on January 15, 2017, 10:25:27 AM

Yes, we've already established that you're primarily interested in stealing from people and that you're unable to understand that many others are not exactly supportive of your goals.

You are aware, that this is already beyond the threshold of legal action?

I am trying to find a hash160 collision. If you have evidence that I have stolen something in the past or evidence I plan to steal something in the future, please either present that evidence or keep it near when the lawyer knocks on your door.

Until then, you may educate yourself: https://lbc.cryptoguru.org/man/theory

And if you continue with your allegation(s), I really have no problem to step out of (pseudo)-anonymity and let my lawyers drill your sorry ass until they hit crude oil. And this is not even a rude statement.

Quote

You seem to think that people who don't slavishly help you steal from others are bad or incompetent people, or at least that you can get them to help you by insulting them as though they were. For someone who claims to be so old, you sure don't seem very wise.

Final warning. I can benevolently assume, that your reactions are based on the false assumption I am someone who does or wants to steal from people. Under such an assumption I would probably also interact "harshly" with my counterpart. So I give you that. But unless you have the slightest proof for your allegation/assumption, the time to stop it has come.

Quote

Congrats, by ignoring the _constant time_ in the function documentation you just leaked the users secret to a timing side-channel-- and _still_ managed to end up with code considerably slower than simply changing to non-constant time competently.

Ok. Hasn't explained why cmov implementation looks like it does, because right now - after this explanation - it seems to me cmov is short for "constant time move" - which would lead us to my original *thumbs up*...

Rico

gmaxwell

staff

Activity: 4326

Merit: 8951

Quote from: rico666 on January 15, 2017, 09:00:26 AM

Well - it isn't for me, right now I am interested in generation performance.

Yes, we've already established that you're primarily interested in stealing from people and that you're unable to understand that many others are not exactly supportive of your goals.

You seem to think that people who don't slavishly help you steal from others are bad or incompetent people, or at least that you can get them to help you by insulting them as though they were. For someone who claims to be so old, you sure don't seem very wise.

Quote

my advanced intuition tells me (1) is the magnitude of the result,

Yes, the comments trace the magnitude of the results; and tie the code back to the algebraic verification of the functions (in sage/). There is no point in a comment that merely repeats the code.

Quote

Code:

if (degenerate) {
n = m;
}
else {
secp256k1_fe_sqr(&n, &n);
}

Hm. Works. Is faster. Tests says "no problems found". And now the code makes sense.
Of course it is only a trivial thing. Probably not worth the effort. Right?

So conditional move you say?

Code:

/** If flag is true, set *r equal to *a; otherwise leave it. Constant-time. */
static void secp256k1_fe_cmov(secp256k1_fe *r, const secp256k1_fe *a, int flag);

yes? Why do I see then this epic kludge in the code?

Congrats, by ignoring the _constant time_ in the function documentation you just leaked the users secret to a timing side-channel-- and _still_ managed to end up with code considerably slower than simply changing to non-constant time competently.

rico666

legendary

Activity: 1120

Merit: 1037

฿ → ∞

Quote from: gmaxwell on January 14, 2017, 10:14:27 AM

It's a trivial optimization that already existed in three other places in the code. Thanks for noticing that it hadn't been performed there, and providing code for one of the two places it needed to be improved, but come on-- you're just making yourself look foolish here with the rude attitude while you're clearly fairly ignorant about what you're talking about overall. Case in point:

...verification ... verification ... verification

Gee - you haven't seen my rude attitude yet. And you won't, because I am honestly too old for these cockfights.
It is evident you are completely project blind. Verification is king. Well - it isn't for me, right now I am interested in generation performance. Now the big revelation dear Maxwell:

This is what libraries are about.

A library(sic!) is not about serving just the purpose of sci-fi fans. And a librarian who is a sci-fi fan and is carried away by his fandom in a way he bars or neglects other uses of the library, is a BAD librarian. Don't be a bad librarian.

Your reaction was expected. Now my "trivial" optimization is hardly worth it. Hey - it makes the relevant functions faster by 6-7 orders of magnitude, but as it brings only 0.5% for DA HOLY VERIFICATION ... nah. Elegantly this also allows you to sweep under the carpet the question why the rubbish code remained for so long in the lib. "Not worth it" I suppose... where did I read that? Roll Eyes

edit:

Quote from: gmaxwell on January 14, 2017, 10:14:27 AM

Odd to hear that, since the libsecp256k1 code has received many positive comments about its clarity and origination. One researcher described it as the cleanest production ECC code that he'd read.

Style preferences differ, of course, -- lets see what your own code looks like

ftp://ftp.cryptoguru.org/LBC/client/LBC

So you take my intentionally deformatted code as counterexample of coding style ... *gasp* I don't think that futile attempt does even qualify as pathetic. Let's see a snapshot from my editor.

https://i.imgur.com/JWSJlee.png

etc. Better luck next time.
/edit

Quote

But what is strange is that it is _extensively_ documented.

Ok. Let's take this piece of code and dissect it a little bit - shall we? I am fair and take what is supposedly one of the best commented pieces "secp256k1_gej_add_ge"

Code:

    secp256k1_fe_cmov(&rr_alt, &rr, !degenerate);
    secp256k1_fe_cmov(&m_alt, &m, !degenerate);
    /* Now Ralt / Malt = lambda and is guaranteed not to be 0/0.
     * From here on out Ralt and Malt represent the numerator
     * and denominator of lambda; R and M represent the explicit
     * expressions x1^2 + x2^2 + x1x2 and y1 + y2. */
    secp256k1_fe_sqr(&n, &m_alt);                       /* n = Malt^2 (1) */
    secp256k1_fe_mul(&q, &n, &t);                       /* q = Q = T*Malt^2 (1) */
    /* These two lines use the observation that either M == Malt or M == 0,
     * so M^3 * Malt is either Malt^4 (which is computed by squaring), or
     * zero (which is "computed" by cmov). So the cost is one squaring
     * versus two multiplications. */
    secp256k1_fe_sqr(&n, &n);
    secp256k1_fe_cmov(&n, &m, degenerate);              /* n = M^3 * Malt (2) */

So Malt is m_alt and Ralt is rr_alt - yeah sure, why not.

What's more interesting is the correspondence between the code and the comments.

q = Q = T*Malt^2 (1)

WTF is Q and what is (1)? Oh - my advanced intuition tells me (1) is the magnitude of the result, but I could be wrong - some time to verify that needed. So the comment does not correspond to the code. The comment corresponds to the sum (not meant in an arithmetic sense) of the previous codes - it seems. Interesting... But this is all only prologue.

Let's look at the last two lines of the code snippet:

So the lib computes n = n², but if degenerate is true this computation is discarded as n is set to m. It claims in the aforementioned "documentation" this is a very witty thing to do.

Methinks. I could be wrong. If I am right, the code is pretty bad, if I am wrong, I've been misled after all I know about the code (say 20 hours diddling with it in total). Personally, I'd compute the sqr only in case !degenerate. So the cost would actually be one squaring - sometimes - versus nothing. And this is just one tiny little example of a myriad of weirdnesses all around the code.

So given what the perfect documentation tells me cmov being conditional move, the code would have been better written as:

Code:

if (degenerate) {
n = m;
}
else {
secp256k1_fe_sqr(&n, &n);
}

Hm. Works. Is faster. Tests says "no problems found". And now the code makes sense.
Of course it is only a trivial thing. Probably not worth the effort. Right?

So conditional move you say?

Code:

/** If flag is true, set *r equal to *a; otherwise leave it. Constant-time. */
static void secp256k1_fe_cmov(secp256k1_fe *r, const secp256k1_fe *a, int flag);

Code:

if (flag) {
*r = *a;
}

yes? Why do I see then this epic kludge in the code?

Code:

static SECP256K1_INLINE void secp256k1_fe_cmov(secp256k1_fe *r, const secp256k1_fe *a, int flag) {
    uint64_t mask0, mask1;
    mask0 = flag + ~((uint64_t)0);
    mask1 = ~mask0;
    r->n[0] = (r->n[0] & mask0) | (a->n[0] & mask1);
    r->n[1] = (r->n[1] & mask0) | (a->n[1] & mask1);
    r->n[2] = (r->n[2] & mask0) | (a->n[2] & mask1);
    r->n[3] = (r->n[3] & mask0) | (a->n[3] & mask1);
    r->n[4] = (r->n[4] & mask0) | (a->n[4] & mask1);
#ifdef VERIFY
    if (a->magnitude > r->magnitude) {
        r->magnitude = a->magnitude;
    }
    r->normalized &= a->normalized;
#endif
}

Let's continue with the "extensive documentation":

Code:

    secp256k1_fe_sqr(&t, &rr_alt);                      /* t = Ralt^2 (1) */
    secp256k1_fe_mul(&r->z, &a->z, &m_alt);             /* r->z = Malt*Z (1) */
    infinity = secp256k1_fe_normalizes_to_zero(&r->z) * (1 - a->infinity);
    secp256k1_fe_mul_int(&r->z, 2);                     /* r->z = Z3 = 2*Malt*Z (2) */
    secp256k1_fe_negate(&q, &q, 1);                     /* q = -Q (2) */
    secp256k1_fe_add(&t, &q);                           /* t = Ralt^2-Q (3) */

Hm. Ah. There is our Q. We assume it is an alias for q. Why - who knows? Makes things more interesting.
Again, we have our "the comment is the sum of previous code" and suddenly we use the alias only. Yeah why not.
Best of all is to be able to have a subtraction in the comment, when the code says "add". This is way more thrill.

And allow me for a final remark:

Quote

... even though this is purely internal code and is not accessible to an end user of the library.

WTF? The "end user of the library" is of exactly 0 (in words: zero) interest to our discussion.

Hint: All this was only a remark. My energy is probably better spent in setting up my own FE arithmetics and a complete reimplementation of the GE API which as is absolutely ignores the natural flow of the data and need for MULADD, resp SETSUM 3-ARG operations. Don't worry, I won't disturb the intellectual incest any more. It could be considered rude.

Good luck.

Rico

AlexGR

legendary

Activity: 1708

Merit: 1049

Quote from: gmaxwell on January 14, 2017, 11:49:28 AM

Neat. You shouldn't benchmark using the tests: they're full of debugging instrumentation that distorts the performance and spend a lot of their time on random things. Compile with --enable-benchmarks and use the benchmarks.

A quick check on i7-4600U doesn't give a really clear result:

Before: field_sqr: min 0.0915us / avg 0.0917us / max 0.0928us field_mul: min 0.116us / avg 0.116us / max 0.117us field_inverse: min 25.2us / avg 25.7us / max 28.5us field_inverse_var: min 13.8us / avg 13.9us / max 14.0us field_sqrt: min 24.9us / avg 25.0us / max 25.2us ecdsa_verify: min 238us / avg 238us / max 239us After (v1): field_sqr: min 0.0924us / avg 0.0924us / max 0.0928us field_mul: min 0.117us / avg 0.117us / max 0.117us field_inverse: min 25.4us / avg 25.5us / max 25.9us field_inverse_var: min 13.7us / avg 13.7us / max 14.0us field_sqrt: min 25.1us / avg 25.3us / max 26.1us ecdsa_verify: min 237us / avg 237us / max 237us After (v2): field_sqr: min 0.0942us / avg 0.0942us / max 0.0944us field_mul: min 0.118us / avg 0.118us / max 0.119us field_inverse: min 25.9us / avg 26.0us / max 26.4us field_inverse_var: min 13.6us / avg 13.7us / max 13.8us field_sqrt: min 25.6us / avg 25.9us / max 27.8us ecdsa_verify: min 243us / avg 244us / max 246us

Hmm... interesting how different architectures are affected. Unless you are underclocked, I think for that particular cpu the times are pretty slow - is there any debugging or performance-logging framework running on top of this that creates overhead, distorting the performance? (Although I do expect newer chips to have better schedulers). Realistically, you should be quite faster than me. (my lib is with ./configure -enable-benchmark and gcc default flags, no endomorphism).

For comparison (q8200 @ 1.86)

Before:
field_sqr: min 0.0680us / avg 0.0681us / max 0.0683us
field_mul: min 0.0833us / avg 0.0835us / max 0.0841us
field_inverse: min 18.5us / avg 18.6us / max 18.8us
field_inverse_var: min 6.32us / avg 6.32us / max 6.33us
field_sqrt: min 18.4us / avg 18.6us / max 18.9us
ecdsa_verify: min 243us / avg 243us / max 245us

(v1)
field_sqr: min 0.0654us / avg 0.0660us / max 0.0667us
field_mul: min 0.0819us / avg 0.0822us / max 0.0825us
field_inverse: min 18.4us / avg 18.4us / max 18.5us
field_inverse_var: min 6.35us / avg 6.36us / max 6.37us
field_sqrt: min 18.4us / avg 18.4us / max 18.5us
ecdsa_verify: min 235us / avg 236us / max 237us

(v2)
field_sqr: min 0.0660us / avg 0.0675us / max 0.0679us
field_mul: min 0.0858us / avg 0.0861us / max 0.0862us
field_inverse: min 18.8us / avg 18.8us / max 18.8us
field_inverse_var: min 6.31us / avg 6.31us / max 6.31us
field_sqrt: min 18.5us / avg 18.6us / max 18.7us
ecdsa_verify: min 243us / avg 243us / max 244us

I've always used the benchmarks provided, but I think they may lack real world correlation. My wakeup call was a few months ago I disassembled the bench_internal to see what clang and gcc were doing differently in terms of asm... one was inlining/merging the benchmark and the function to be benchmarked, thus saving the overhead of calling it and distorting the result. I think it was clang which was merging it - and that particular benchmark was faster for it. So I couldn't tell due to this type of distortion which implementation was actually faster. I think it would be a nice addition if we had something like the validation of, say, a given amount of bitcoin blocks (let's say 10-20mb of data loaded in ram) as a more RL-like benchmark.

Btw, I remember having seen a video where you gave a lecture about the library to a university (?) and commenting on the tests of the library, saying something to the effect that perhaps in the future a bounty can be issued about bugs that exist but can't be detected by the tests.

Asm tampering (especially if you try to repurpose rdi/rsi registers) is definitely one of the fields were you can have the test run fine and then have bench_internal or bench_verify abort due to error. Or the opposite (benchmark run ok, test crashes). Or have it be entirely ok in one compiler (test/benchmarks) and then crash in another. This is due to the compiler using the registers differently prior or after the functions in conjuction with other functions, and the same code is some times OK in certain use cases (executables) and crashes in other use cases (different executables), so it's much trickier than C because I have no idea how a test could catch these. After all it can only test it's own execution.

My "manual" testing routine to see if everything is ok, is by going ./tests, ./bench_internal, ./bench_verify. If everything passes, it's probably good. This is not for the 5x52 (which doesn't have unstable code in it) but for my custom made 4x64 impl.h with different secp256k1_scalar_mul and secp256k1_scalar_sqr).

I wanted to put the whole file in so that no cutting and splicing are needed for 2 functions, but the forum notification is bugged (saying I have a post >64kbytes when I don't) so I had to cut down on the text. Anyway...

Code:

static void secp256k1_scalar_mul(secp256k1_scalar *r, const secp256k1_scalar *a, const secp256k1_scalar *b) {
 #ifdef USE_ASM_X86_64
    uint64_t l[8];
    const uint64_t *pb = b->d;
    
    __asm__ __volatile__(
    /* Preload */
    "movq 0(%%rdi), %%r15\n"
    "movq 8(%%rdi), %%rbx\n"
    "movq 16(%%rdi), %%rcx\n"
    "movq 0(%%rdx), %%r11\n"
    "movq 8(%%rdx), %%r9\n"
    "movq 16(%%rdx), %%r10\n"
    "movq 24(%%rdx), %%r8\n"
    /* (rax,rdx) = a0 * b0 */
    "movq %%r15, %%rax\n"
    "mulq %%r11\n"
    /* Extract l0 */
    "movq %%rax, 0(%%rsi)\n"
    /* (r14,r12,r13) = (rdx) */
    "movq %%rdx, %%r14\n"
    "xorq %%r12, %%r12\n"
    "xorq %%r13, %%r13\n"
    /* (r14,r12,r13) += a0 * b1 */
    "movq %%r15, %%rax\n"
    "mulq %%r9\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r13\n"
    /* (r14,r12,r13) += a1 * b0 */
    "mulq %%r11\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    /* Extract l1 */
    "movq %%r14, 8(%%rsi)\n"
    "movq $0, %%r14\n"
    /* (r12,r13,r14) += a0 * b2 */
    "movq %%r15, %%rax\n"
    "adcq $0, %%r13\n"
    "mulq %%r10\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r14\n"
    /* (r12,r13,r14) += a1 * b1 */
    "mulq %%r9\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r14\n"
    /* (r12,r13,r14) += a2 * b0 */
    "mulq %%r11\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    /* Extract l2 */
    "movq %%r12, 16(%%rsi)\n"
    "movq $0, %%r12\n"
    /* (r13,r14,r12) += a0 * b3 */
    "movq %%r15, %%rax\n"
    "adcq $0, %%r14\n"
    "mulq %%r8\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    /* Preload a3 */
    "movq 24(%%rdi), %%r15\n"
    /* (r13,r14,r12) += a1 * b2 */
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r10\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r12\n"
    /* (r13,r14,r12) += a2 * b1 */
    "mulq %%r9\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    "movq %%r15, %%rax\n"
    "adcq $0, %%r12\n"
    /* (r13,r14,r12) += a3 * b0 */
    "mulq %%r11\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    /* Extract l3 */
    "movq %%r13, 24(%%rsi)\n"
    "movq $0, %%r13\n"
    /* (r14,r12,r13) += a1 * b3 */
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r8\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r13\n"
    /* (r14,r12,r13) += a2 * b2 */
    "mulq %%r10\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%r15, %%rax\n"
    "adcq $0, %%r13\n"
    /* (r14,r12,r13) += a3 * b1 */
    "mulq %%r9\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r13\n"
    /* Extract l4 */
   /* "movq %%r14, 32(%%rsi)\n"*/
    /* (r12,r13,r14) += a2 * b3 */
    "mulq %%r8\n"
    "movq %%r14, %%r11\n"
    "xorq %%r14, %%r14\n"
    "addq %%rax, %%r12\n"
    "movq %%r15, %%rax\n"
    "adcq %%rdx, %%r13\n"
    "adcq $0, %%r14\n"
    /* (r12,r13,r14) += a3 * b2 */
    "mulq %%r10\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "movq %%r15, %%rax\n"
    "adcq $0, %%r14\n"
    /* Extract l5 */
    /*"movq %%r12, 40(%%rsi)\n"*/
    /* (r13,r14) += a3 * b3 */
    "mulq %%r8\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    /* Extract l6 */
    /*"movq %%r13, 48(%%rsi)\n"*/
    /* Extract l7 */
    /*"movq %%r14, 56(%%rsi)\n"*/
    : "+d"(pb)
    : "S"(l), "D"(a->d)
    : "rax", "rbx", "rcx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
    

      __asm__ __volatile__(
    /* Preload. */
  /*  "movq 32(%%rsi), %%r11\n" */
  /*  "movq 40(%%rsi), %%r12\n" */
   /*"movq 48(%%rsi), %%r13\n" */
  /*   "movq 56(%%rsi), %%r14\n" */
    "movq 0(%%rsi), %%rbx\n"  
    "movq %3, %%rax\n"
    "movq %%rax, %%r10\n"
    "xor %%ecx, %%ecx\n"  
    "xorq %%r15, %%r15\n"
    "xorq %%r9, %%r9\n"
    "xorq %%r8, %%r8\n"
    "mulq %%r11\n"
    "addq %%rax, %%rbx\n" /*q0 into rbx*/
    "adcq %%rdx, %%rcx\n"
    "addq 8(%%rsi), %%rcx\n" 
    "movq %%r10, %%rax\n"
    "adcq %%r9, %%r15\n"
    "mulq %%r12\n"
    "addq %%rax, %%rcx\n" /*q1 stored to rcx*/
    "adcq %%rdx, %%r15\n"
    "movq %4, %%rax\n" 
    "adcq %%r9, %%r8\n"
    "mulq %%r11\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r15\n"
    "adcq %%r9, %%r8\n"
    "addq 16(%%rsi), %%r15\n"
    "adcq %%r9, %%r8\n"
    "movq %%r10, %%rax\n"
    "adcq %%r9, %%r9\n"
    "mulq %%r13\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r8\n"
    "movq %4, %%rax\n"
    "adcq $0, %%r9\n"
    "mulq %%r12\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    "movq %%r10, %%rax\n"
    "movq $0, %%r10\n"
    "addq %%r11, %%r15\n" /*q2 into r15*/
    "adcq $0, %%r8\n"
    "adcq $0, %%r9\n"
    "addq 24(%%rsi), %%r8\n"
    "adcq $0, %%r9\n"
    "adcq %%r10, %%r10\n"
    "mulq %%r14\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "movq %4, %%rax\n"  
    "movq %%rax, %%rsi\n"  
    "adcq $0, %%r10\n"
    "mulq %%r13\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    "addq %%r8, %%r12\n" /* q3 into r12*/
    "adcq $0, %%r9\n"
    "movq $0, %%r8\n"
    "movq %%rsi, %%rax\n" 
    "adcq $0, %%r10\n"
    "mulq %%r14\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq %%r8, %%r8\n"
    "addq %%r9, %%r13\n" /*q4 into r13*/
    "adcq $0, %%r10\n" 
    "adcq $0, %%r8\n" 
    "addq %%r14, %%r10\n" /* q5 into r10 */ 
    "movq %3, %%rax\n"
    "movq %%rax, %%r9\n"
    "adcq $0, %%r8\n" /*q6 into r8*/
  
/* %q5 input for second operation is %q0 output from first / RBX as the connecting link
    %q6 input for second operation is %q1 output from first / RCX as the connecting link
    %q7 input for second operation is %q2 output from first / R15 as the connecting link
    %q8 input for second operation is %q3 output from first / R12 as the connecting link
    %q9  input for second operation is %q4 output from first / R13 as the connecting link*
    %q10 input for second operation is %q5 output from first / R10 as the connecting link*
    %q11 input for second operation is %q6 output from first  / R8 as the connecting link */    
    
    /* Reduce 385 bits into 258. */

    "mulq %%r13\n"
    "xorq %%r14, %%r14\n"
    "xorq %%r11, %%r11\n"
    "addq %%rax, %%rbx\n" /* q0 output*/
    "adcq %%rdx, %%r14\n"
    "addq %%rcx, %%r14\n" 
    "mov $0, %%ecx\n"  
    "movq %%r9, %%rax\n"
    "adcq %%r11, %%r11\n"
    "mulq %%r10\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r11\n"
    "movq %%rsi, %%rax\n"
    "adcq %%rcx, %%rcx\n"
    "mulq %%r13\n"
    "addq %%rax, %%r14\n" /* q1 output */
    "movq %%r9, %%rax\n"
    "adcq %%rdx, %%r11\n"
    "adcq $0, %%rcx\n"
    "xorq %%r9, %%r9\n"
    "addq %%r15, %%r11\n" 
    "adcq %%r9, %%rcx\n"
    "movq %%rax, %%r15\n"
    "adcq %%r9, %%r9\n"
    "mulq %%r8\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r9\n"
    "mulq %%r10\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "adcq $0, %%r9\n"
    "addq %%r13, %%r11\n" /* q2 output */
    "adcq $0, %%rcx\n"
    "adcq $0, %%r9\n"
    "addq %%r12, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r9\n"
    "mulq %%r8\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r9\n"
    "addq %%r10, %%rcx\n"    /* q3 output */
    "adcq $0, %%r9\n"
    "movq %%r15, %%rax\n"
    "addq %%r8, %%r9\n" /* q4 output */
    
/* %q1 input for next operation is %q0 output from prior / RBX as the connecting link
    %q2 input for next operation is %q1 output from prior / R14 as the connecting link 
    %q3 input for next operation is %q2 output from prior / R11 as the connecting link  
    %q4 input for next operation is %q3 output from prior / RCX as the connecting link
    %q5 input for next operation is %q4 output from prior / R9 as the connecting link   */
        
    /* Reduce 258 bits into 256. */

    "mulq %%r9\n"   
    "addq %%rbx, %%rax\n"
    "adcq $0, %%rdx\n"
    "movq %%rax, %%r8\n"  /* 0(q2) output */
    "movq %%rdx, %%r12\n" 
    "xorq %%r13, %%r13\n"
    "addq %%r14, %%r12\n"
    "movq %%rsi, %%rax\n"
    "adcq %%r13, %%r13\n"
    "mulq %%r9\n"
    "addq %%rax, %%r12\n" /* 8(q2) output */
    "adcq %%rdx, %%r13\n" 
    "xor %%ebx, %%ebx\n"
    "addq %%r9, %%r13\n"
    "adcq %%rbx, %%rbx\n"
    "movq $0xffffffffffffffff, %%r14\n"
    "addq %%r11, %%r13\n" /* 16(q2) output */
    "movq $0, %%r11\n"
    "adcq $0, %%rbx\n"
    "addq %%rcx, %%rbx\n"  /* 24(q2) output */
    "adcq $0, %%r11\n" /* c  output */

    
/*FINAL REDUCTION */
    
/*    r8 carries ex 0(%%rdi), 
       r12 carries ex 8(%%rdi),
       r13 carries ex 16(%%rdi), 
       rbx carries ex 24(%%rdi)
       r11 carries c */
    "movq $0xbaaedce6af48a03b,%%r9\n"
    "movq $0xbaaedce6af48a03a,%%rcx\n"
    "movq $0xbfd25e8cd0364140,%%r10\n"
    "cmp   %%r14 ,%%rbx\n"
    "setne %%dl\n"
    "cmp   $0xfffffffffffffffd,%%r13\n"
    "setbe %%al\n"
    "or     %%eax,%%edx\n"
    "cmp  %%rcx,%%r12\n"
    "setbe %%cl\n"
    "or     %%edx,%%ecx\n"
    "cmp  %%r9,%%r12\n"
    "movzbl %%dl,%%edx\n"
    "seta  %%r9b\n"
    "cmp  %%r10,%%r8\n"
    "movzbl %%cl,%%ecx\n"
    "seta  %%r10b\n"
    "not   %%ecx\n"
    "not   %%edx\n"
    "or     %%r10d,%%r9d\n"
    "movzbl %%r9b,%%r9d\n"
    "and   %%r9d,%%ecx\n"
    "xor    %%r9d,%%r9d\n"
    "cmp   %%r14,%%r13\n"
    "sete  %%r9b\n"
    "xor   %%r10d,%%r10d\n"
    "and   %%r9d,%%edx\n"
    "or     %%edx,%%ecx\n"
    "xor   %%edx,%%edx\n"
    "add  %%ecx,%%r11d\n"
    "imulq %%r11,%%r15\n"
    "addq  %%r15,%%r8\n"
    "adcq  %%rdx,%%r10\n"  
    "imulq %%r11,%%rsi\n"
    "xorq %%r15,%%r15\n"
    "xor   %%eax,%%eax\n"
    "movq  %%r8,0(%q2)\n"
    "xor   %%edx,%%edx\n"
    "addq %%r12,%%rsi\n"
    "adcq %%rdx,%%rdx\n" 
    "addq %%rsi,%%r10\n"
    "movq %%r10,8(%q2)\n"
    "adcq %%rdx,%%r15\n"
    "addq %%r11,%%r13\n"
    "adcq %%rax,%%rax\n" 
    "addq %%r15,%%r13\n"
    "movq %%r13,16(%q2)\n"
    "adcq $0,%%rax\n"
    "addq %%rbx,%%rax\n"
    "movq %%rax,24(%q2)\n"
    : "=D"(r)
    : "S"(l), "D"(r), "n"(SECP256K1_N_C_0), "n"(SECP256K1_N_C_1)
    : "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory"); 
     

#else
    uint64_t l[8];
    secp256k1_scalar_mul_512(l, a, b);
    secp256k1_scalar_reduce_512(r, l);
#endif   
}

Code:

static void secp256k1_scalar_sqr(secp256k1_scalar *r, const secp256k1_scalar *a) {
 #ifdef USE_ASM_X86_64
    uint64_t l[8];
    
    __asm__ __volatile__(
    /* Preload */
    "movq 0(%%rdi), %%r11\n"
    "movq 8(%%rdi), %%r12\n"
    "movq 16(%%rdi), %%rcx\n"
    "movq 24(%%rdi), %%r14\n"
    /* (rax,rdx) = a0 * a0 */
    "movq %%r11, %%rax\n"
    "mulq %%r11\n"
    /* Extract l0 */
    "movq %%rax, %%rbx\n" /*0(%%rsi)\n"*/
    /* (r8,r9,r10) = (rdx,0) */
    "movq %%rdx, %%r15\n"
    "xorq %%r9, %%r9\n"
    "xorq %%r10, %%r10\n"
    "xorq %%r8, %%r8\n"
    /* (r8,r9,r10) += 2 * a0 * a1 */
    "movq %%r11, %%rax\n"
    "mulq %%r12\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    "addq %%rax, %%r15\n" /*8 rsi in r15*/
    "adcq %%rdx, %%r9\n"
    "movq %%r11, %%rax\n"
    "adcq $0, %%r10\n"
    /* Extract l1 */
   /* 8(rsi) in r15*/
    /* (r9,r10,r8) += 2 * a0 * a2 */
    "mulq %%rcx\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq $0, %%r8\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "movq %%r12, %%rax\n"
    "adcq $0, %%r8\n"
    /* (r9,r10,r8) += a1 * a1 */
    "mulq %%r12\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    /* Extract l2 */
    "movq %%r9, 16(%%rsi)\n"
    "movq %%r11, %%rax\n"
    "movq $0, %%r9\n"
    /* (r10,r8,r9) += 2 * a0 * a3 */
    "adcq $0, %%r8\n"
    "mulq %%r14\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "movq %%r12, %%rax\n"
    "adcq $0, %%r9\n"
    /* (r10,r8,r9) += 2 * a1 * a2 */
    "mulq %%rcx\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "movq %%r10, %%r13\n"
    "movq %%r12, %%rax\n"
    "adcq $0, %%r9\n"
    /* Extract l3 */
    /*"movq %%r10, 24(%%rsi)\n"*/

    /* (r8,r9,r10) += 2 * a1 * a3 */
    "mulq %%r14\n"
    "xorq %%r10, %%r10\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r10\n"
    /* (r8,r9,r10) += a2 * a2 */
    "mulq %%rcx\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    /* Extract l4 */
    /*"movq %%r8, 32(%%rsi)\n"*/
    "movq %%r8, %%r11\n"
    "movq %%rcx, %%rax\n"
    "movq $0, %%r8\n"
    /* (r9,r10,r8) += 2 * a2 * a3 */
    "adcq $0, %%r10\n"
    "mulq %%r14\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq $0, %%r8\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "movq %%r14, %%rax\n"
    "adcq $0, %%r8\n"
    /* Extract l5 */
    /*"movq %%r9, 40(%%rsi)\n"*/
 /*   "movq %%r9, %%r12\n"*/
    /* (r10,r8) += a3 * a3 */
    "mulq %%r14\n"
    "addq %%rax, %%r10\n"
    /* Extract l6 */
    /*"movq %%r10, 48(%%rsi)\n"*/
    /*"movq %%r10, %%rcx\n"*/
    /* Extract l7 */
    /*"movq %%r8, 56(%%rsi)\n"*/
    /*"movq %%r8, %%r14\n"*/
    :
    : "S"(l), "D"(a->d)
    : "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
        
      __asm__ __volatile__(
    /* Preload. */
  /*  "movq 32(%%rsi), %%r11\n" */
  /*  "movq 40(%%rsi), %%r9\n" */
  /*   "movq 48(%%rsi), %%r10\n" */
  /*   "movq 56(%%rsi), %%r8\n" */
  /*  "movq 0(%%rsi), %%rbx\n"  */
 /*   "movq %%rcx, %%r13\n"*/
    "movq %3, %%rax\n"
    "adcq %%rdx, %%r8\n"
    "mulq %%r11\n"
    "xor %%ecx, %%ecx\n" 
    "xorq %%r12, %%r12\n"
    "xorq %%r14, %%r14\n"
    "addq %%rax, %%rbx\n" /*q0 into rbx*/
    "adcq %%rdx, %%rcx\n"
 /*   "addq 8(%%rsi), %%rcx\n" */
    "addq %%r15, %%rcx\n" 
    "mov $0, %%r15d\n"
    "movq %3, %%rax\n"
    "adcq %%r12, %%r15\n"
    "mulq %%r9\n"
    "addq %%rax, %%rcx\n" /*q1 stored to rcx*/
    "adcq %%rdx, %%r15\n"
    "movq %4, %%rax\n" 
    "adcq %%r12, %%r14\n"
    "mulq %%r11\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r15\n"
    "adcq %%r12, %%r14\n"
    "addq 16(%%rsi), %%r15\n"
    "adcq %%r12, %%r14\n"
    "movq %3, %%rax\n"
    "adcq %%r12, %%r12\n"
    "mulq %%r10\n"
    "movq %4, %%rsi\n"  
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r14\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r9\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r14\n"
    "adcq $0, %%r12\n"
    "movq %3, %%rax\n"
    "addq %%r11, %%r15\n" /*q2 into r15*/
    "adcq $0, %%r14\n"
    "adcq $0, %%r12\n"
    "addq %%r13, %%r14\n"
    "movq $0, %%r13\n"
    "adcq $0, %%r12\n"
    "adcq $0, %%r13\n"
    "mulq %%r8\n"
    "addq %%rax, %%r14\n"
    "movq %%rsi, %%rax\n"  
    "adcq %%rdx, %%r12\n"
    "adcq $0, %%r13\n"
    "mulq %%r10\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "adcq $0, %%r13\n"
    "addq %%r14, %%r9\n" /* q3 into r9*/
    "adcq $0, %%r12\n"
    "movq %%rsi, %%rax\n" 
    "movq $0, %%r14\n"
    "adcq $0, %%r13\n"
    "mulq %%r8\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "adcq %%r14, %%r14\n"
    "addq %%r12, %%r10\n" /*q4 into r10*/
    "adcq $0, %%r13\n" 
    "adcq $0, %%r14\n" 
    "addq %%r8, %%r13\n" /* q5 into r13 */ 
    "movq %3, %%rax\n"
    "movq %%rax, %%r12\n"
    "adcq $0, %%r14\n" /*q6 into r14*/
  
/* %q5 input for second operation is %q0 output from first / RBX as the connecting link
    %q6 input for second operation is %q1 output from first / RCX as the connecting link
    %q7 input for second operation is %q2 output from first / R15 as the connecting link
    %q8 input for second operation is %q3 output from first / r9 as the connecting link
    %q9  input for second operation is %q4 output from first / r10 as the connecting link*
    %q10 input for second operation is %q5 output from first / r13 as the connecting link*
    %q11 input for second operation is %q6 output from first  / r14 as the connecting link */    
    
    /* Reduce 385 bits into 258. */

    "mulq %%r10\n"
    "xorq %%r8, %%r8\n"
    "xorq %%r11, %%r11\n"
    "addq %%rax, %%rbx\n" /* q0 output*/
    "adcq %%rdx, %%r8\n"
    "addq %%rcx, %%r8\n" 
    "movq %%r12, %%rax\n"
    "mov $0, %%ecx\n"  
    "adcq %%r11, %%r11\n"
    "mulq %%r13\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r11\n"
    "movq %%rsi, %%rax\n"
    "adcq %%rcx, %%rcx\n"
    "mulq %%r10\n"
    "addq %%rax, %%r8\n" /* q1 output */
    "movq %%r12, %%rax\n"
    "adcq %%rdx, %%r11\n"
    "adcq $0, %%rcx\n"
    "xorq %%r12, %%r12\n"
    "addq %%r15, %%r11\n" 
    "adcq %%r12, %%rcx\n"
    "movq %%rax, %%r15\n"
    "adcq %%r12, %%r12\n"
    "mulq %%r14\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r13\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "adcq $0, %%r12\n"
    "addq %%r10, %%r11\n" /* q2 output */
    "adcq $0, %%rcx\n"
    "adcq $0, %%r12\n"
    "addq %%r9, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r14\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r12\n"
    "addq %%r13, %%rcx\n"    /* q3 output */
    "adcq $0, %%r12\n"
    "movq %%r15, %%rax\n"
    "addq %%r14, %%r12\n" /* q4 output */
    
/* %q1 input for next operation is %q0 output from prior / RBX as the connecting link
    %q2 input for next operation is %q1 output from prior / r8 as the connecting link 
    %q3 input for next operation is %q2 output from prior / R11 as the connecting link  
    %q4 input for next operation is %q3 output from prior / RCX as the connecting link
    %q5 input for next operation is %q4 output from prior / r12 as the connecting link   */
        
    /* Reduce 258 bits into 256. */

    "mulq %%r12\n"   
    "addq %%rbx, %%rax\n"
    "adcq $0, %%rdx\n"
    "movq %%rax, %%r14\n"  /* 0(q2) output */
    "movq %%rdx, %%r9\n" 
    "xorq %%r10, %%r10\n"
    "addq %%r8, %%r9\n"
    "movq %%rsi, %%rax\n"
    "adcq %%r10, %%r10\n"
    "mulq %%r12\n"
    "addq %%rax, %%r9\n" /* 8(q2) output */
    "adcq %%rdx, %%r10\n" 
    "xor %%ebx, %%ebx\n"
    "addq %%r12, %%r10\n"
    "adcq %%rbx, %%rbx\n"
    "movq $0xffffffffffffffff, %%r8\n"
    "addq %%r11, %%r10\n" /* 16(q2) output */
    "movq $0, %%r11\n"
    "adcq $0, %%rbx\n"
    "addq %%rcx, %%rbx\n"  /* 24(q2) output */
    "adcq $0, %%r11\n" /* c  output */

    
/*FINAL REDUCTION */
    
/*    r14 carries ex 0(%%rdi), 
       r9 carries ex 8(%%rdi),
       r10 carries ex 16(%%rdi), 
       rbx carries ex 24(%%rdi)
       r11 carries c */
    "movq $0xbaaedce6af48a03b,%%r12\n"
    "movq $0xbaaedce6af48a03a,%%rcx\n"
    "movq $0xbfd25e8cd0364140,%%r13\n"
    "cmp   %%r8 ,%%rbx\n"
    "setne %%dl\n"
    "cmp   $0xfffffffffffffffd,%%r10\n"
    "setbe %%al\n"
    "or     %%eax,%%edx\n"
    "cmp  %%rcx,%%r9\n"
    "setbe %%cl\n"
    "or     %%edx,%%ecx\n"
    "cmp  %%r12,%%r9\n"
    "movzbl %%dl,%%edx\n"
    "seta  %%r12b\n"
    "cmp  %%r13,%%r14\n"
    "movzbl %%cl,%%ecx\n"
    "seta  %%r13b\n"
    "not   %%ecx\n"
    "not   %%edx\n"
    "or     %%r13d,%%r12d\n"
    "movzbl %%r12b,%%r12d\n"
    "and   %%r12d,%%ecx\n"
    "xor    %%r12d,%%r12d\n"
    "cmp   %%r8,%%r10\n"
    "sete  %%r12b\n"
    "xor   %%r13d,%%r13d\n"
    "and   %%r12d,%%edx\n"
    "or     %%edx,%%ecx\n"
    "xor   %%edx,%%edx\n"
    "add  %%ecx,%%r11d\n"
    "imulq %%r11,%%r15\n"
    "addq  %%r15,%%r14\n"
    "adcq  %%rdx,%%r13\n"  
    "imulq %%r11,%%rsi\n"
    "xorq %%r15,%%r15\n"
    "xor   %%eax,%%eax\n"
    "movq  %%r14,0(%q2)\n"
    "xor   %%edx,%%edx\n"
    "addq %%r9,%%rsi\n"
    "adcq %%rdx,%%rdx\n" 
    "addq %%rsi,%%r13\n"
    "movq %%r13,8(%q2)\n"
    "adcq %%rdx,%%r15\n"
    "addq %%r11,%%r10\n"
    "adcq %%rax,%%rax\n" 
    "addq %%r15,%%r10\n"
    "movq %%r10,16(%q2)\n"
    "adcq $0,%%rax\n"
    "addq %%rbx,%%rax\n"
    "movq %%rax,24(%q2)\n"
    : "=D"(r)
    : "S"(l), "D"(r), "n"(SECP256K1_N_C_0), "n"(SECP256K1_N_C_1)
    : "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");      
     
#else
    uint64_t l[8];
    secp256k1_scalar_sqr_512(l, a);
    secp256k1_scalar_reduce_512(r, l);
#endif    
}

This, measured right now, gives

(original)
scalar_sqr: min 0.134us / avg 0.135us / max 0.136us
scalar_mul: min 0.141us / avg 0.143us / max 0.144us
scalar_inverse: min 40.5us / avg 40.6us / max 40.9us

(my hacked version - only gcc)
scalar_sqr: min 0.122us / avg 0.122us / max 0.122us
scalar_mul: min 0.126us / avg 0.127us / max 0.127us
scalar_inverse: min 36.7us / avg 36.9us / max 37.1us

The way the original code is (very readable for maintenance though - unlike my crap), if one dissassembles it, shows something like that:

1) mul512 or sqr512 starts and then writes its output to variables

2) Then we have pops and pushes for the next function which is reduce512

3) The reduce512 function imports the data from the outputs of #1

4) Reduce512 goes in 3 stages with each stage writing its own distinct output to variables and then the next stage imports it as its input. (The three stages can be streamlined by merging them - always using registers. The necessity for distinct output points and input points is then redundant / less moves and no need for variables).

5) As reduce512 ends, it puts its own output to variables

6) Final reduction imports the output of (5) and processes it.

My rationale was that if mul512 OR sqr 512+reduce512+final reduction go together, in one asm, one saves a lot of inputs/outputs and pops/pushes. Plus code size goes down significantly (1300 bytes => 1000 bytes) which leaves some extra L1 cache for other stuff. Reduce512/mul512/sqr512 still exists as code (altered) but they aren't really called. What gets called is the unified secp256k1_scalar_mul and the secp256k1_scalar_sqr - which have everything inside them. This was proof of concept so to speak, because I was seeing the disassembled output and I was like "AARRRGGHHH why can't one stage or function simply forward its results with the registers and there is all this pushing and popping and ram and variables".

For example, this is the behavior between (5) and (6) in the disassembled output of the original reduce512:

406246:   4c 89 4f 10     mov %r9,0x10(%rdi)
  40624a:   4d 31 c9    xor %r9,%r9
  40624d:   49 01 f0    add %rsi,%r8
  406250:   49 83 d1 00     adc $0x0,%r9
  406254:   4c 89 47 18     mov %r8,0x18(%rdi)
  406258:   4c 89 cb    mov %r9,%rbx
  40625b:   4c 8b 5f 18     mov 0x18(%rdi),%r11
  40625f:   48 8b 77 10     mov 0x10(%rdi),%rsi

My thoughts were like "ok, these ram moves are redundant and HAVE TO GO". Why should r8 write to ram and then get reimported from ram to r11? Why should r9 go to ram and get re-imported instead of going straight to rsi? Waste of time". I had the same reaction every time I spotted data going out and then getting moved back in as input - instead of being used as is).

Still, the source is very readable the way it is right now and the performance tradeoff is not that large compared to understanding what each thing does.

piotr_n

legendary

Activity: 2058

Merit: 1421

aka tonikt

Quote from: piotr_n on January 14, 2017, 01:37:18 PM

Why don't you want to discuss the protocol changes needed for bootstrapping clients with utxo snapshots?
Because you obviously don't.
Please explain me: if it isn't about your ego, then what is it about?

Oh... No comments?
Well, let me guess, then...

You can't talk about it, because it happens that this specific feature is being researched by a company that you work and signed an NDA for?

Maybe it's indeed not about your ego.
Maybe it's just about money.

piotr_n

legendary

Activity: 2058

Merit: 1421

aka tonikt

Mind that later I spent some time with sipa asking him how to realize the bip 37 solution for fetching the missing transactions. He came back to me the next day saying that it wasn't actually possible.
So yes, I was listening.
But at that moment it was pretty clear that in order to improve block downloading times some network protocol changes were needed. Whilst the message I got from you was that you weren't willing to make any changes because you didn't seem to see block downloading times as a problem worth solving.

Just like at this moment you don't seem to see the bootstrapping time to be a problem worth solving.
Although not because it's not really worth solving, but because you want to solve it yourself.
Because you only care about a progress in Bitcoin development when it goes along the way with indulging your ego.
If it doesn't, then it has to wait for when you have time.

gmaxwell

staff

Activity: 4326

Merit: 8951

Quote from: piotr_n on January 14, 2017, 01:49:35 PM

No - I came to you saying that I would like to work on improving the block download times.

I really suggest you read the log. That might be what you were intending on doing; but what you did was some and insist on a specific protocol change to allow downloading blocks in tiny chunks.

We explained why this was unlikely to improve performance, was likely to create vulnerabilities, and suggested some things that we thought more likely to be successful. We also suggested you try making the changes in your own protocol so you could measure the result, and not just believe us or rely on speculation.

Quote

Why don't you want to discuss the protocol changes needed for bootstrapping clients with utxo snapshots?
Because you obviously don't.

Huh? Because it's far far offtopic and you're just switching subjects-- but I've talked about this many times here before. If you're referring to just trusting miners to give you a correct UTXO snapshot it is a _massive_ change in the security model-- and if you're happy with that change you can just use SPV. If you're talking about static cached UTXO data, then there isn't any consensus change needed... and software implementing that just ... implements it.

Quote

And you told me off, saying basically that it wasn't necessary. Plus also some other bullshit about BIP37 that didn't turn out to be true.

Code:

10:03 < tonikt> Like being able to download a block in fragments
10:03 < sipa> tonikt: BIP37 allows that, in a way
[...]
10:05 < gmaxwell> tonikt: Not a very worthwhile thing in my opinion, and as sipa points out its already possible.

What you were specifically suggesting wasn't necessary or helpful. And what we told you about BIP37 was true (afaict), you could have used it to download chunks of blocks. (and though it wasn't mentioned there Pieter had already tried it out, though mostly to eliminate already transfered txns... but the overheads made it not a useful improvement).

Quote

But then few years later, suddenly improving the block download times became so much desirable and you've been so proud of you delivering it to the public.
Pathetic

Don't confuse some specific proposal being unhelpful for the goal which you hoped for it which it could not accomplish with that goal being an unhelpful.

You were very aggressively pushing a particular change based on a flawed belief that it would help, all I was doing was pointing out why I didn't expect that particular change to be helpful (that transferring data in 32KB chunks would be slower due to latency/overhead and more vulnerable), as well as suggesting what you could to prove out the effectiveness of the change or do it on your own without worrying about what we thought. It seems you only understand working on things exactly the way you think they should be done, and believe that anyone who doesn't is saying they don't care about improvement. I think you should consider some other possibilities.

piotr_n

legendary

Activity: 2058

Merit: 1421

aka tonikt

Where did I say that compact blocks were 'my proposal'?

No - I came to you saying that I would like to work on improving the block download times.
And you told me off, saying basically that it wasn't necessary. Plus also some other bullshit about BIP37 that didn't turn out to be true.

Code:

10:05 < gmaxwell> tonikt: Not a very worthwhile thing in my opinion, and as sipa points out its already possible.

But then few years later, suddenly improving the block download times became so much desirable and you've been so proud of you delivering it to the public.
Pathetic

Topic: secp256k1 library and Intel cpu (Read 4059 times)