Author

Topic: secp256k1 library and Intel cpu (Read 4048 times)

newbie
Activity: 2
Merit: 0
April 14, 2022, 01:51:36 PM
#45
Hi All,

  What's the current fastest version of secp256k1_fe_mul for i5/i7?
  Currently that's my bottleneck on secp256k1_unsafe lib. Which has the same asm as the official.

One of the ASM in this thread caused segfault for me, the other was slower than the original.

    %   cumulative   self                   self     total           
   time   seconds   seconds          calls  ms/call  ms/call  name   
  46.24      4.67     4.67 235910453     0.00     0.00  secp256k1_fe_mul_inner
  13.71      6.06     1.39 88835758     0.00     0.00  secp256k1_fe_sqr_inner
   10.05      7.07     1.02        2   507.51   507.51  sha256_transf
legendary
Activity: 1140
Merit: 1000
The Real Jude Austin
March 19, 2017, 04:00:49 AM
#44
You are not a wallet thief, man - you've done nothing wrong in that matters.
I am also not a wallet thief, even though the man has also somehow accused me of such.

We are all dicks though. Smiley

If I was to be upset by all the names that different kind of dicks call me, I'd have been upset all me life.
It's a public forum, people have different characters which are going to clash and you have to know how to deal with it - especially if you are older than them.
But that is not the point.

The point is that I have been here for years and if someone asked me today how I have contributed into an actual "technical development" of bitcoin by my activity on this "Development & Technical Discussion" forum, I'd have said: I haven't at all.
Although I have certainly made my dick bigger here.
And the sad part is that everybody seems to be here for that reason.

We are all definitely dicks, lol.

 
staff
Activity: 4270
Merit: 1209
I support freedom of choice
January 16, 2017, 06:39:05 PM
#43
I'm not sure if you will find them better, but on bitco.in/forum has some places for technical discussions. Maybe, or maybe not, you can find other opinions/help about this and other objects.
legendary
Activity: 2058
Merit: 1416
aka tonikt
January 16, 2017, 05:24:34 PM
#42
You are not a wallet thief, man - you've done nothing wrong in that matters.
I am also not a wallet thief, even though the man has also somehow accused me of such.

We are all dicks though. Smiley

If I was to be upset by all the names that different kind of dicks call me, I'd have been upset all me life.
It's a public forum, people have different characters which are going to clash and you have to know how to deal with it - especially if you are older than them.
But that is not the point.

The point is that I have been here for years and if someone asked me today how I have contributed into an actual "technical development" of bitcoin by my activity on this "Development & Technical Discussion" forum, I'd have said: I haven't at all.
Although I have certainly made my dick bigger here.
And the sad part is that everybody seems to be here for that reason.
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 16, 2017, 05:03:24 PM
#41
I really wish this forum would sometimes happen to be about an actual technical discussion and not always about who has a bigger dick.

How can that be if it's moderated by a bearded dwarf (with the according dick size) acting out his moderator power fantasies.
I mean look what happened:

I contributed to secp256k1, but then I made the fatal mistake to criticize it and the even more fatal mistake to criticize his holiness Maxwell the 1st.

Immediately I am - according to treebeard extraordinaire - "A wallet thief", "incompetent" and what not.
Plus of course the usual negative trust rating he gives in these cases. Yay - my 1st negative trust rating here ever.

How can you have a technical discussion, if the gnome here is swinging his sceptre. Not benevolently I might add if you happen to not wank his ego.

Then technical discussion happens via PM (thanks AlexGR et al.) or elsewhere and result in non-public solutions.

Yep guys - there you have it: Your glorious core developer deity barring people from bringing their ideas/work in. Sorry, not exact. *Taking their contribution and fuck them over afterwards.* Yep more like it.


Rico
legendary
Activity: 2058
Merit: 1416
aka tonikt
January 16, 2017, 04:26:15 PM
#40
I really wish this forum would sometimes happen to be about an actual technical discussion and not always about who has a bigger dick.
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 16, 2017, 03:38:50 PM
#39
Instead, when we meet at the next Bitcoin event we'll both be attending, I'll approach you and we'll handle our arguments like real men. Promise.
"Violence is the last refuge of the incompetent."

Interesting. So real men are incompetent in your eyes. I thought real men (including intelligence) have a beer.

My bad. Assuming intelligence. Won't happen again. Sorry.


Rico
staff
Activity: 4326
Merit: 8951
January 16, 2017, 10:27:16 AM
#38
Instead, when we meet at the next Bitcoin event we'll both be attending, I'll approach you and we'll handle our arguments like real men. Promise.
"Violence is the last refuge of the incompetent."
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 16, 2017, 01:11:33 AM
#37
"I started making keys, starting with ones with fewest cuts and systematically working through all possibilities. To learn if these keys matched any that had been used in the past, I tried each one in every door in the neighborhood.  After a bit I found a few valuables. What was I supposed to do, leave them there?"

Yeah. I had lot's of these discussions. Your comparison doesn't apply - even remotely.

"I started taking walks in the park - systematically taking paths to cover the whole area. From time to time I find some coins. What am I supposed to do, leave them there?"

The doors in the neighborhood have names on them. And yes, even "for finds in the park" rules apply. We adhere to them.

You are lucky, this night the pool found something again. The funds are still on the address. What would be your take on this now?

It's a rhetoric question, I do not really need your input. As promised I slept over our - for me yesterdays - "conversation". I guess I'll leave the lawyers in their box this time. Instead, when we meet at the next Bitcoin event we'll both be attending, I'll approach you and we'll handle our arguments like real men. Promise.


Rico
staff
Activity: 4326
Merit: 8951
January 15, 2017, 01:05:56 PM
#36
"I started making keys, starting with ones with fewest cuts and systematically working through all possibilities. To learn if these keys matched any that had been used in the past, I tried each one in every door in the neighborhood.  After a bit I found a few valuables. What was I supposed to do, leave them there?"
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 15, 2017, 12:02:29 PM
#35
A normal move is constant time but not conditional, so "constant time move" wouldn't make much sense.

It was a hint to the mnemonics used. "cctmov ctcmov ..."

But I understand the importance of constant time now better (especially why https://github.com/llamasoft/secp256k1_fast_unsafe proclaims itself as being unsafe).

Quote
If so, only by a highly inefficient means. hash160 is a hash function, there is no EC involved at all. If you were merely attempting to  find a hash160 collision you'd simply need to run hash160. You also wouldn't have any reason to check your results against assets in the bitcoin chain...  You might also use an efficient collision finding algorithm instead of brute force.

I know the hash160 is only the RIPEMD160(SHA256(x)) part of the generation process. So maybe I'm just missing the nomenclature here for "full generation chain collision". Unfortunately such a thing is not mentioned here https://bitcointalksearch.org/topic/reward-offered-for-hash-collisions-for-sha1-sha256-ripemd160-and-other-293382

The reason why it's checking against the funds is explained in the text I referenced.

Quote
Quote
If you have evidence that I have stolen something in the past or evidence I plan to steal something in the future, please either present that evidence
You stole funds in this transaction: https://blockchain.info/tx/e094692e7d198500480ff5de3d6816e5054708bdea77f3c7db2fc3263f776b75

Ah. And I did so in "full disclosure", haven't touched the funds on the custody address and if you point me to the rightful owner you are aware he gets not only his funds back, but also quite a bounty from RyanC and I believe a round-up from myself?

I mean do you even think before you write some things? Right now I am

  • doing the LBC project as a hobby
  • never took any money/donation for it (although offered) and don't intend to
  • except the users testing, never wanted anyone to "slavishly" contribute any work
  • on the contrary, I contributed off-spins from this to other projects (secp256k1, Linux Kernel - which will have a faster RIPEMD160 because of me)

Actually, it looks like I have the fastest RIPEMD160 CPU implementation at the moment.

I can accept, that you call the LBC project "inefficient", "badly designed" or whatever as response to my criticism of the libsecp256k1, but  pulling the "you're a thief" card is really low.

If you consider transferring the 0.00778379 BTC funds to custody "stealing", you should really work on your judgement. What should I have had done instead? Leave them alone, just saying "Pool has found something, but we won't tell you what and where"Huh With the "pool search front" being quite visible even this announcement would lead to jeopardizing the funds. So what then? Not doing it at all? Maybe you need more poeple to tell you to stop working in the Crypto world and breed sheep instead.

If you have suggestions for an escrow for the finds, come forth.

Quote
Quote
And if you continue with your allegation(s), I really have no problem to step out of (pseudo)-anonymity and let my lawyers drill your sorry ass until they hit crude oil. And this is not even a rude statement.
Bring it on.

I see. Let me sleep a night over it and then decide if educating you is worth the effort.


Rico
staff
Activity: 4326
Merit: 8951
January 15, 2017, 11:20:21 AM
#34
Ok. Hasn't explained why cmov implementation looks like it does, because right now - after this explanation - it seems to me cmov is short for "constant time move"
It is a conditional move, and does exactly what the comment says it does: "If flag is true, set *r equal to *a; otherwise leave it. Constant-time.". The construction used for it is one that results in constant time behavior on our target systems, isn't undermined by the compiler, and which is relatively fast.   A normal move is constant time but not conditional, so "constant time move" wouldn't make much sense.

Quote
I am trying to find a hash160 collision.
If so, only by a highly inefficient means. hash160 is a hash function, there is no EC involved at all. If you were merely attempting to  find a hash160 collision you'd simply need to run hash160. You also wouldn't have any reason to check your results against assets in the bitcoin chain...  You might also use an efficient collision finding algorithm instead of brute force.

Quote
If you have evidence that I have stolen something in the past or evidence I plan to steal something in the future, please either present that evidence

You stole funds in this transaction: https://blockchain.info/tx/e094692e7d198500480ff5de3d6816e5054708bdea77f3c7db2fc3263f776b75

Quote
And if you continue with your allegation(s), I really have no problem to step out of (pseudo)-anonymity and let my lawyers drill your sorry ass until they hit crude oil. And this is not even a rude statement.
Bring it on.
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 15, 2017, 10:56:23 AM
#33
Yes, we've already established that you're primarily interested in stealing from people and that you're unable to understand that many others are not exactly supportive of your goals.

You are aware, that this is already beyond the threshold of legal action?

I am trying to find a hash160 collision. If you have evidence that I have stolen something in the past or evidence I plan to steal something in the future, please either present that evidence or keep it near when the lawyer knocks on your door.

Until then, you may educate yourself: https://lbc.cryptoguru.org/man/theory

And if you continue with your allegation(s), I really have no problem to step out of (pseudo)-anonymity and let my lawyers drill your sorry ass until they hit crude oil. And this is not even a rude statement.

Quote
You seem to think that people who don't slavishly help you steal from others are bad or incompetent people, or at least that you can get them to help you by insulting them as though they were. For someone who claims to be so old, you sure don't seem very wise. Smiley

Final warning. I can benevolently assume, that your reactions are based on the false assumption I am someone who does or wants to steal from people. Under such an assumption I would probably also interact "harshly" with my counterpart. So I give you that. But unless you have the slightest proof for your allegation/assumption, the time to stop it has come.


Quote
Congrats, by ignoring the _constant time_ in the function documentation you just leaked the users secret to a timing side-channel-- and _still_ managed to end up with code considerably slower than simply changing to non-constant time competently.

Ok. Hasn't explained why cmov implementation looks like it does, because right now - after this explanation - it seems to me cmov is short for "constant time move" - which would lead us to my original *thumbs up*...


Rico
staff
Activity: 4326
Merit: 8951
January 15, 2017, 10:25:27 AM
#32
Well - it isn't for me, right now I am interested in generation performance.

Yes, we've already established that you're primarily interested in stealing from people and that you're unable to understand that many others are not exactly supportive of your goals.

You seem to think that people who don't slavishly help you steal from others are bad or incompetent people, or at least that you can get them to help you by insulting them as though they were. For someone who claims to be so old, you sure don't seem very wise. Smiley

Quote
my advanced intuition tells me (1) is the magnitude of the result,

Yes, the comments trace the magnitude of the results; and tie the code back to the algebraic verification of the functions (in sage/). There is no point in a comment that merely repeats the code.

Quote
Code:
if (degenerate) {
n = m;
}
else {
secp256k1_fe_sqr(&n, &n);
}

Hm. Works. Is faster. Tests says "no problems found". And now the code makes sense.
Of course it is only a trivial thing. Probably not worth the effort. Right?

So conditional move you say?

Code:
/** If flag is true, set *r equal to *a; otherwise leave it. Constant-time. */
static void secp256k1_fe_cmov(secp256k1_fe *r, const secp256k1_fe *a, int flag);
yes? Why do I see then this epic kludge in the code?

Congrats, by ignoring the _constant time_ in the function documentation you just leaked the users secret to a timing side-channel-- and _still_ managed to end up with code considerably slower than simply changing to non-constant time competently.

legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 15, 2017, 09:00:26 AM
#31
It's a trivial optimization that already existed in three other places in the code. Thanks for noticing that it hadn't been performed there, and providing code for one of the two places it needed to be improved, but come on-- you're just making yourself look foolish here with the rude attitude while you're clearly fairly ignorant about what you're talking about overall. Case in point:

...verification ... verification ... verification

Gee - you haven't seen my rude attitude yet. And you won't, because I am honestly too old for these cockfights.
It is evident you are completely project blind. Verification is king. Well - it isn't for me, right now I am interested in generation performance. Now the big revelation dear Maxwell:

This is what libraries are about.

A library(sic!) is not about serving just the purpose of sci-fi fans. And a librarian who is a sci-fi fan and is carried away by his fandom in a way he bars or neglects other uses of the library, is a BAD librarian. Don't be a bad librarian.

Your reaction was expected. Now my "trivial" optimization is hardly worth it. Hey - it makes the relevant functions faster by 6-7 orders of magnitude, but as it brings only 0.5% for DA HOLY VERIFICATION ... nah. Elegantly this also allows you to sweep under the carpet the question why the rubbish code remained for so long in the lib. "Not worth it" I suppose... where did I read that?  Roll Eyes

edit:
Odd to hear that, since the libsecp256k1 code has received many positive comments about its clarity and origination. One researcher described it as the cleanest production ECC code that he'd read.

Style preferences differ, of course,  -- lets see what your own code looks like

ftp://ftp.cryptoguru.org/LBC/client/LBC

 

So you take my intentionally deformatted code as counterexample of coding style ... *gasp* I don't think that futile attempt does even qualify as pathetic. Let's see a snapshot from my editor.

https://i.imgur.com/JWSJlee.png

etc. Better luck next time.
/edit

Quote
But what is strange is that it is _extensively_ documented.

Ok. Let's take this piece of code and dissect it a little bit - shall we? I am fair and take what is supposedly one of the best commented pieces "secp256k1_gej_add_ge"

Code:
   secp256k1_fe_cmov(&rr_alt, &rr, !degenerate);
    secp256k1_fe_cmov(&m_alt, &m, !degenerate);
    /* Now Ralt / Malt = lambda and is guaranteed not to be 0/0.
     * From here on out Ralt and Malt represent the numerator
     * and denominator of lambda; R and M represent the explicit
     * expressions x1^2 + x2^2 + x1x2 and y1 + y2. */
    secp256k1_fe_sqr(&n, &m_alt);                       /* n = Malt^2 (1) */
    secp256k1_fe_mul(&q, &n, &t);                       /* q = Q = T*Malt^2 (1) */
    /* These two lines use the observation that either M == Malt or M == 0,
     * so M^3 * Malt is either Malt^4 (which is computed by squaring), or
     * zero (which is "computed" by cmov). So the cost is one squaring
     * versus two multiplications. */
    secp256k1_fe_sqr(&n, &n);
    secp256k1_fe_cmov(&n, &m, degenerate);              /* n = M^3 * Malt (2) */

So Malt is m_alt and Ralt is rr_alt - yeah sure, why not.

What's more interesting is the correspondence between the code and the comments.

q = Q = T*Malt^2 (1)

WTF is Q and what is (1)? Oh - my advanced intuition tells me (1) is the magnitude of the result, but I could be wrong - some time to verify that needed. So the comment does not correspond to the code. The comment corresponds  to the sum (not meant in an arithmetic sense) of the previous codes - it seems. Interesting... But this is all only prologue.

Let's look at the last two lines of the code snippet:

So the lib computes n = n², but if degenerate is true this computation is discarded as n is set to m. It claims in the aforementioned "documentation" this is a very witty thing to do.

Methinks. I could be wrong. If I am right, the code is pretty bad, if I am wrong, I've been misled after all I know about the code (say 20 hours diddling with it in total). Personally, I'd compute the sqr only in case !degenerate. So the cost would actually be one squaring - sometimes - versus nothing. And this is just one tiny little example of a myriad of weirdnesses all around the code.

So given what the perfect documentation tells me cmov being conditional move, the code would have been better written as:

Code:
if (degenerate) {
n = m;
}
else {
secp256k1_fe_sqr(&n, &n);
}

Hm. Works. Is faster. Tests says "no problems found". And now the code makes sense.
Of course it is only a trivial thing. Probably not worth the effort. Right?

So conditional move you say?

Code:
/** If flag is true, set *r equal to *a; otherwise leave it. Constant-time. */
static void secp256k1_fe_cmov(secp256k1_fe *r, const secp256k1_fe *a, int flag);

Code:
if (flag) {
*r = *a;
}

yes? Why do I see then this epic kludge in the code?

Code:
static SECP256K1_INLINE void secp256k1_fe_cmov(secp256k1_fe *r, const secp256k1_fe *a, int flag) {
    uint64_t mask0, mask1;
    mask0 = flag + ~((uint64_t)0);
    mask1 = ~mask0;
    r->n[0] = (r->n[0] & mask0) | (a->n[0] & mask1);
    r->n[1] = (r->n[1] & mask0) | (a->n[1] & mask1);
    r->n[2] = (r->n[2] & mask0) | (a->n[2] & mask1);
    r->n[3] = (r->n[3] & mask0) | (a->n[3] & mask1);
    r->n[4] = (r->n[4] & mask0) | (a->n[4] & mask1);
#ifdef VERIFY
    if (a->magnitude > r->magnitude) {
        r->magnitude = a->magnitude;
    }
    r->normalized &= a->normalized;
#endif
}


Let's continue with the "extensive documentation":

Code:
   secp256k1_fe_sqr(&t, &rr_alt);                      /* t = Ralt^2 (1) */
    secp256k1_fe_mul(&r->z, &a->z, &m_alt);             /* r->z = Malt*Z (1) */
    infinity = secp256k1_fe_normalizes_to_zero(&r->z) * (1 - a->infinity);
    secp256k1_fe_mul_int(&r->z, 2);                     /* r->z = Z3 = 2*Malt*Z (2) */
    secp256k1_fe_negate(&q, &q, 1);                     /* q = -Q (2) */
    secp256k1_fe_add(&t, &q);                           /* t = Ralt^2-Q (3) */

Hm.  Ah. There is our Q. We assume it is an alias for q. Why - who knows? Makes things more interesting.
Again, we have our "the comment is the sum of previous code" and suddenly we use the alias only. Yeah why not.
Best of all is to be able to have a subtraction in the comment, when the code says "add". This is way more thrill.

And allow me for a final remark:

Quote
... even though this is purely internal code and is not accessible to an end user of the library.

WTF? The "end user of the library" is of exactly 0 (in words: zero) interest to our discussion.

Hint: All this was only a remark. My energy is probably better spent in setting up my own FE arithmetics and a complete reimplementation of the GE API which as is absolutely ignores the natural flow of the data and need for MULADD, resp SETSUM 3-ARG operations. Don't worry, I won't disturb the intellectual incest any more. It could be considered rude.

Good luck.

Rico
legendary
Activity: 1708
Merit: 1049
January 14, 2017, 05:06:59 PM
#30
Neat.  You shouldn't benchmark using the tests: they're full of debugging instrumentation that distorts the performance and spend a lot of their time on random things.  Compile with --enable-benchmarks and use the benchmarks. Smiley

A quick check on i7-4600U doesn't give a really clear result:


Before:
field_sqr: min 0.0915us / avg 0.0917us / max 0.0928us
field_mul: min 0.116us / avg 0.116us / max 0.117us
field_inverse: min 25.2us / avg 25.7us / max 28.5us
field_inverse_var: min 13.8us / avg 13.9us / max 14.0us
field_sqrt: min 24.9us / avg 25.0us / max 25.2us
ecdsa_verify: min 238us / avg 238us / max 239us

After (v1):
field_sqr: min 0.0924us / avg 0.0924us / max 0.0928us
field_mul: min 0.117us / avg 0.117us / max 0.117us
field_inverse: min 25.4us / avg 25.5us / max 25.9us
field_inverse_var: min 13.7us / avg 13.7us / max 14.0us
field_sqrt: min 25.1us / avg 25.3us / max 26.1us
ecdsa_verify: min 237us / avg 237us / max 237us

After (v2):
field_sqr: min 0.0942us / avg 0.0942us / max 0.0944us
field_mul: min 0.118us / avg 0.118us / max 0.119us
field_inverse: min 25.9us / avg 26.0us / max 26.4us
field_inverse_var: min 13.6us / avg 13.7us / max 13.8us
field_sqrt: min 25.6us / avg 25.9us / max 27.8us
ecdsa_verify: min 243us / avg 244us / max 246us



Hmm... interesting how different architectures are affected. Unless you are underclocked, I think for that particular cpu the times are pretty slow - is there any debugging or performance-logging framework running on top of this that creates overhead, distorting the performance? (Although I do expect newer chips to have better schedulers). Realistically, you should be quite faster than me. (my lib is with ./configure -enable-benchmark and gcc default flags, no endomorphism).

For comparison (q8200 @ 1.86)

Before:
field_sqr: min 0.0680us / avg 0.0681us / max 0.0683us
field_mul: min 0.0833us / avg 0.0835us / max 0.0841us
field_inverse: min 18.5us / avg 18.6us / max 18.8us
field_inverse_var: min 6.32us / avg 6.32us / max 6.33us
field_sqrt: min 18.4us / avg 18.6us / max 18.9us
ecdsa_verify: min 243us / avg 243us / max 245us

(v1)
field_sqr: min 0.0654us / avg 0.0660us / max 0.0667us
field_mul: min 0.0819us / avg 0.0822us / max 0.0825us
field_inverse: min 18.4us / avg 18.4us / max 18.5us
field_inverse_var: min 6.35us / avg 6.36us / max 6.37us
field_sqrt: min 18.4us / avg 18.4us / max 18.5us
ecdsa_verify: min 235us / avg 236us / max 237us

(v2)
field_sqr: min 0.0660us / avg 0.0675us / max 0.0679us
field_mul: min 0.0858us / avg 0.0861us / max 0.0862us
field_inverse: min 18.8us / avg 18.8us / max 18.8us
field_inverse_var: min 6.31us / avg 6.31us / max 6.31us
field_sqrt: min 18.5us / avg 18.6us / max 18.7us
ecdsa_verify: min 243us / avg 243us / max 244us

I've always used the benchmarks provided, but I think they may lack real world correlation. My wakeup call was a few months ago I disassembled the bench_internal to see what clang and gcc were doing differently in terms of asm... one was inlining/merging the benchmark and the function to be benchmarked, thus saving the overhead of calling it and distorting the result. I think it was clang which was merging it - and that particular benchmark was faster for it. So I couldn't tell due to this type of distortion which implementation was actually faster. I think it would be a nice addition if we had something like the validation of, say, a given amount of bitcoin blocks (let's say 10-20mb of data loaded in ram) as a more RL-like benchmark.

Btw, I remember having seen a video where you gave a lecture about the library to a university (?) and commenting on the tests of the library, saying something to the effect that perhaps in the future a bounty can be issued about bugs that exist but can't be detected by the tests.

Asm tampering (especially if you try to repurpose rdi/rsi registers) is definitely one of the fields were you can have the test run fine and then have bench_internal or bench_verify abort due to error. Or the opposite (benchmark run ok, test crashes). Or have it be entirely ok in one compiler (test/benchmarks) and then crash in another. This is due to the compiler using the registers differently prior or after the functions in conjuction with other functions, and the same code is some times OK in certain use cases (executables) and crashes in other use cases (different executables), so it's much trickier than C because I have no idea how a test could catch these. After all it can only test it's own execution.

My "manual" testing routine to see if everything is ok, is by going ./tests, ./bench_internal, ./bench_verify. If everything passes, it's probably good. This is not for the 5x52 (which doesn't have unstable code in it) but for my custom made 4x64 impl.h with different secp256k1_scalar_mul and secp256k1_scalar_sqr).

I wanted to put the whole file in so that no cutting and splicing are needed for 2 functions, but the forum notification is bugged (saying I have a post >64kbytes when I don't) so I had to cut down on the text. Anyway...

Code:
static void secp256k1_scalar_mul(secp256k1_scalar *r, const secp256k1_scalar *a, const secp256k1_scalar *b) {
 #ifdef USE_ASM_X86_64
    uint64_t l[8];
    const uint64_t *pb = b->d;
    
    __asm__ __volatile__(
    /* Preload */
    "movq 0(%%rdi), %%r15\n"
    "movq 8(%%rdi), %%rbx\n"
    "movq 16(%%rdi), %%rcx\n"
    "movq 0(%%rdx), %%r11\n"
    "movq 8(%%rdx), %%r9\n"
    "movq 16(%%rdx), %%r10\n"
    "movq 24(%%rdx), %%r8\n"
    /* (rax,rdx) = a0 * b0 */
    "movq %%r15, %%rax\n"
    "mulq %%r11\n"
    /* Extract l0 */
    "movq %%rax, 0(%%rsi)\n"
    /* (r14,r12,r13) = (rdx) */
    "movq %%rdx, %%r14\n"
    "xorq %%r12, %%r12\n"
    "xorq %%r13, %%r13\n"
    /* (r14,r12,r13) += a0 * b1 */
    "movq %%r15, %%rax\n"
    "mulq %%r9\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r13\n"
    /* (r14,r12,r13) += a1 * b0 */
    "mulq %%r11\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    /* Extract l1 */
    "movq %%r14, 8(%%rsi)\n"
    "movq $0, %%r14\n"
    /* (r12,r13,r14) += a0 * b2 */
    "movq %%r15, %%rax\n"
    "adcq $0, %%r13\n"
    "mulq %%r10\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r14\n"
    /* (r12,r13,r14) += a1 * b1 */
    "mulq %%r9\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r14\n"
    /* (r12,r13,r14) += a2 * b0 */
    "mulq %%r11\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    /* Extract l2 */
    "movq %%r12, 16(%%rsi)\n"
    "movq $0, %%r12\n"
    /* (r13,r14,r12) += a0 * b3 */
    "movq %%r15, %%rax\n"
    "adcq $0, %%r14\n"
    "mulq %%r8\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    /* Preload a3 */
    "movq 24(%%rdi), %%r15\n"
    /* (r13,r14,r12) += a1 * b2 */
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r10\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r12\n"
    /* (r13,r14,r12) += a2 * b1 */
    "mulq %%r9\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    "movq %%r15, %%rax\n"
    "adcq $0, %%r12\n"
    /* (r13,r14,r12) += a3 * b0 */
    "mulq %%r11\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    /* Extract l3 */
    "movq %%r13, 24(%%rsi)\n"
    "movq $0, %%r13\n"
    /* (r14,r12,r13) += a1 * b3 */
    "movq %%rbx, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r8\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r13\n"
    /* (r14,r12,r13) += a2 * b2 */
    "mulq %%r10\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%r15, %%rax\n"
    "adcq $0, %%r13\n"
    /* (r14,r12,r13) += a3 * b1 */
    "mulq %%r9\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r13\n"
    /* Extract l4 */
   /* "movq %%r14, 32(%%rsi)\n"*/
    /* (r12,r13,r14) += a2 * b3 */
    "mulq %%r8\n"
    "movq %%r14, %%r11\n"
    "xorq %%r14, %%r14\n"
    "addq %%rax, %%r12\n"
    "movq %%r15, %%rax\n"
    "adcq %%rdx, %%r13\n"
    "adcq $0, %%r14\n"
    /* (r12,r13,r14) += a3 * b2 */
    "mulq %%r10\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "movq %%r15, %%rax\n"
    "adcq $0, %%r14\n"
    /* Extract l5 */
    /*"movq %%r12, 40(%%rsi)\n"*/
    /* (r13,r14) += a3 * b3 */
    "mulq %%r8\n"
    "addq %%rax, %%r13\n"
    "adcq %%rdx, %%r14\n"
    /* Extract l6 */
    /*"movq %%r13, 48(%%rsi)\n"*/
    /* Extract l7 */
    /*"movq %%r14, 56(%%rsi)\n"*/
    : "+d"(pb)
    : "S"(l), "D"(a->d)
    : "rax", "rbx", "rcx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
    

      __asm__ __volatile__(
    /* Preload. */
  /*  "movq 32(%%rsi), %%r11\n" */
  /*  "movq 40(%%rsi), %%r12\n" */
   /*"movq 48(%%rsi), %%r13\n" */
  /*   "movq 56(%%rsi), %%r14\n" */
    "movq 0(%%rsi), %%rbx\n"  
    "movq %3, %%rax\n"
    "movq %%rax, %%r10\n"
    "xor %%ecx, %%ecx\n"  
    "xorq %%r15, %%r15\n"
    "xorq %%r9, %%r9\n"
    "xorq %%r8, %%r8\n"
    "mulq %%r11\n"
    "addq %%rax, %%rbx\n" /*q0 into rbx*/
    "adcq %%rdx, %%rcx\n"
    "addq 8(%%rsi), %%rcx\n"
    "movq %%r10, %%rax\n"
    "adcq %%r9, %%r15\n"
    "mulq %%r12\n"
    "addq %%rax, %%rcx\n" /*q1 stored to rcx*/
    "adcq %%rdx, %%r15\n"
    "movq %4, %%rax\n"
    "adcq %%r9, %%r8\n"
    "mulq %%r11\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r15\n"
    "adcq %%r9, %%r8\n"
    "addq 16(%%rsi), %%r15\n"
    "adcq %%r9, %%r8\n"
    "movq %%r10, %%rax\n"
    "adcq %%r9, %%r9\n"
    "mulq %%r13\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r8\n"
    "movq %4, %%rax\n"
    "adcq $0, %%r9\n"
    "mulq %%r12\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    "movq %%r10, %%rax\n"
    "movq $0, %%r10\n"
    "addq %%r11, %%r15\n" /*q2 into r15*/
    "adcq $0, %%r8\n"
    "adcq $0, %%r9\n"
    "addq 24(%%rsi), %%r8\n"
    "adcq $0, %%r9\n"
    "adcq %%r10, %%r10\n"
    "mulq %%r14\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "movq %4, %%rax\n"  
    "movq %%rax, %%rsi\n"  
    "adcq $0, %%r10\n"
    "mulq %%r13\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    "addq %%r8, %%r12\n" /* q3 into r12*/
    "adcq $0, %%r9\n"
    "movq $0, %%r8\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r10\n"
    "mulq %%r14\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq %%r8, %%r8\n"
    "addq %%r9, %%r13\n" /*q4 into r13*/
    "adcq $0, %%r10\n"
    "adcq $0, %%r8\n"
    "addq %%r14, %%r10\n" /* q5 into r10 */
    "movq %3, %%rax\n"
    "movq %%rax, %%r9\n"
    "adcq $0, %%r8\n" /*q6 into r8*/
  
/* %q5 input for second operation is %q0 output from first / RBX as the connecting link
    %q6 input for second operation is %q1 output from first / RCX as the connecting link
    %q7 input for second operation is %q2 output from first / R15 as the connecting link
    %q8 input for second operation is %q3 output from first / R12 as the connecting link
    %q9  input for second operation is %q4 output from first / R13 as the connecting link*
    %q10 input for second operation is %q5 output from first / R10 as the connecting link*
    %q11 input for second operation is %q6 output from first  / R8 as the connecting link */    
    
    /* Reduce 385 bits into 258. */

    "mulq %%r13\n"
    "xorq %%r14, %%r14\n"
    "xorq %%r11, %%r11\n"
    "addq %%rax, %%rbx\n" /* q0 output*/
    "adcq %%rdx, %%r14\n"
    "addq %%rcx, %%r14\n"
    "mov $0, %%ecx\n"  
    "movq %%r9, %%rax\n"
    "adcq %%r11, %%r11\n"
    "mulq %%r10\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r11\n"
    "movq %%rsi, %%rax\n"
    "adcq %%rcx, %%rcx\n"
    "mulq %%r13\n"
    "addq %%rax, %%r14\n" /* q1 output */
    "movq %%r9, %%rax\n"
    "adcq %%rdx, %%r11\n"
    "adcq $0, %%rcx\n"
    "xorq %%r9, %%r9\n"
    "addq %%r15, %%r11\n"
    "adcq %%r9, %%rcx\n"
    "movq %%rax, %%r15\n"
    "adcq %%r9, %%r9\n"
    "mulq %%r8\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r9\n"
    "mulq %%r10\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "adcq $0, %%r9\n"
    "addq %%r13, %%r11\n" /* q2 output */
    "adcq $0, %%rcx\n"
    "adcq $0, %%r9\n"
    "addq %%r12, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r9\n"
    "mulq %%r8\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r9\n"
    "addq %%r10, %%rcx\n"    /* q3 output */
    "adcq $0, %%r9\n"
    "movq %%r15, %%rax\n"
    "addq %%r8, %%r9\n" /* q4 output */
    
/* %q1 input for next operation is %q0 output from prior / RBX as the connecting link
    %q2 input for next operation is %q1 output from prior / R14 as the connecting link
    %q3 input for next operation is %q2 output from prior / R11 as the connecting link  
    %q4 input for next operation is %q3 output from prior / RCX as the connecting link
    %q5 input for next operation is %q4 output from prior / R9 as the connecting link   */
        
    /* Reduce 258 bits into 256. */

    "mulq %%r9\n"  
    "addq %%rbx, %%rax\n"
    "adcq $0, %%rdx\n"
    "movq %%rax, %%r8\n"  /* 0(q2) output */
    "movq %%rdx, %%r12\n"
    "xorq %%r13, %%r13\n"
    "addq %%r14, %%r12\n"
    "movq %%rsi, %%rax\n"
    "adcq %%r13, %%r13\n"
    "mulq %%r9\n"
    "addq %%rax, %%r12\n" /* 8(q2) output */
    "adcq %%rdx, %%r13\n"
    "xor %%ebx, %%ebx\n"
    "addq %%r9, %%r13\n"
    "adcq %%rbx, %%rbx\n"
    "movq $0xffffffffffffffff, %%r14\n"
    "addq %%r11, %%r13\n" /* 16(q2) output */
    "movq $0, %%r11\n"
    "adcq $0, %%rbx\n"
    "addq %%rcx, %%rbx\n"  /* 24(q2) output */
    "adcq $0, %%r11\n" /* c  output */

    
/*FINAL REDUCTION */
    
/*    r8 carries ex 0(%%rdi),
       r12 carries ex 8(%%rdi),
       r13 carries ex 16(%%rdi),
       rbx carries ex 24(%%rdi)
       r11 carries c */
    "movq $0xbaaedce6af48a03b,%%r9\n"
    "movq $0xbaaedce6af48a03a,%%rcx\n"
    "movq $0xbfd25e8cd0364140,%%r10\n"
    "cmp   %%r14 ,%%rbx\n"
    "setne %%dl\n"
    "cmp   $0xfffffffffffffffd,%%r13\n"
    "setbe %%al\n"
    "or     %%eax,%%edx\n"
    "cmp  %%rcx,%%r12\n"
    "setbe %%cl\n"
    "or     %%edx,%%ecx\n"
    "cmp  %%r9,%%r12\n"
    "movzbl %%dl,%%edx\n"
    "seta  %%r9b\n"
    "cmp  %%r10,%%r8\n"
    "movzbl %%cl,%%ecx\n"
    "seta  %%r10b\n"
    "not   %%ecx\n"
    "not   %%edx\n"
    "or     %%r10d,%%r9d\n"
    "movzbl %%r9b,%%r9d\n"
    "and   %%r9d,%%ecx\n"
    "xor    %%r9d,%%r9d\n"
    "cmp   %%r14,%%r13\n"
    "sete  %%r9b\n"
    "xor   %%r10d,%%r10d\n"
    "and   %%r9d,%%edx\n"
    "or     %%edx,%%ecx\n"
    "xor   %%edx,%%edx\n"
    "add  %%ecx,%%r11d\n"
    "imulq %%r11,%%r15\n"
    "addq  %%r15,%%r8\n"
    "adcq  %%rdx,%%r10\n"  
    "imulq %%r11,%%rsi\n"
    "xorq %%r15,%%r15\n"
    "xor   %%eax,%%eax\n"
    "movq  %%r8,0(%q2)\n"
    "xor   %%edx,%%edx\n"
    "addq %%r12,%%rsi\n"
    "adcq %%rdx,%%rdx\n"
    "addq %%rsi,%%r10\n"
    "movq %%r10,8(%q2)\n"
    "adcq %%rdx,%%r15\n"
    "addq %%r11,%%r13\n"
    "adcq %%rax,%%rax\n"
    "addq %%r15,%%r13\n"
    "movq %%r13,16(%q2)\n"
    "adcq $0,%%rax\n"
    "addq %%rbx,%%rax\n"
    "movq %%rax,24(%q2)\n"
    : "=D"(r)
    : "S"(l), "D"(r), "n"(SECP256K1_N_C_0), "n"(SECP256K1_N_C_1)
    : "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
    

#else
    uint64_t l[8];
    secp256k1_scalar_mul_512(l, a, b);
    secp256k1_scalar_reduce_512(r, l);
#endif  
}
Code:
static void secp256k1_scalar_sqr(secp256k1_scalar *r, const secp256k1_scalar *a) {
 #ifdef USE_ASM_X86_64
    uint64_t l[8];
    
    __asm__ __volatile__(
    /* Preload */
    "movq 0(%%rdi), %%r11\n"
    "movq 8(%%rdi), %%r12\n"
    "movq 16(%%rdi), %%rcx\n"
    "movq 24(%%rdi), %%r14\n"
    /* (rax,rdx) = a0 * a0 */
    "movq %%r11, %%rax\n"
    "mulq %%r11\n"
    /* Extract l0 */
    "movq %%rax, %%rbx\n" /*0(%%rsi)\n"*/
    /* (r8,r9,r10) = (rdx,0) */
    "movq %%rdx, %%r15\n"
    "xorq %%r9, %%r9\n"
    "xorq %%r10, %%r10\n"
    "xorq %%r8, %%r8\n"
    /* (r8,r9,r10) += 2 * a0 * a1 */
    "movq %%r11, %%rax\n"
    "mulq %%r12\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    "addq %%rax, %%r15\n" /*8 rsi in r15*/
    "adcq %%rdx, %%r9\n"
    "movq %%r11, %%rax\n"
    "adcq $0, %%r10\n"
    /* Extract l1 */
   /* 8(rsi) in r15*/
    /* (r9,r10,r8) += 2 * a0 * a2 */
    "mulq %%rcx\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq $0, %%r8\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "movq %%r12, %%rax\n"
    "adcq $0, %%r8\n"
    /* (r9,r10,r8) += a1 * a1 */
    "mulq %%r12\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    /* Extract l2 */
    "movq %%r9, 16(%%rsi)\n"
    "movq %%r11, %%rax\n"
    "movq $0, %%r9\n"
    /* (r10,r8,r9) += 2 * a0 * a3 */
    "adcq $0, %%r8\n"
    "mulq %%r14\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "movq %%r12, %%rax\n"
    "adcq $0, %%r9\n"
    /* (r10,r8,r9) += 2 * a1 * a2 */
    "mulq %%rcx\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "adcq $0, %%r9\n"
    "addq %%rax, %%r10\n"
    "adcq %%rdx, %%r8\n"
    "movq %%r10, %%r13\n"
    "movq %%r12, %%rax\n"
    "adcq $0, %%r9\n"
    /* Extract l3 */
    /*"movq %%r10, 24(%%rsi)\n"*/

    /* (r8,r9,r10) += 2 * a1 * a3 */
    "mulq %%r14\n"
    "xorq %%r10, %%r10\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "adcq $0, %%r10\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    "movq %%rcx, %%rax\n"
    "adcq $0, %%r10\n"
    /* (r8,r9,r10) += a2 * a2 */
    "mulq %%rcx\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r9\n"
    /* Extract l4 */
    /*"movq %%r8, 32(%%rsi)\n"*/
    "movq %%r8, %%r11\n"
    "movq %%rcx, %%rax\n"
    "movq $0, %%r8\n"
    /* (r9,r10,r8) += 2 * a2 * a3 */
    "adcq $0, %%r10\n"
    "mulq %%r14\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "adcq $0, %%r8\n"
    "addq %%rax, %%r9\n"
    "adcq %%rdx, %%r10\n"
    "movq %%r14, %%rax\n"
    "adcq $0, %%r8\n"
    /* Extract l5 */
    /*"movq %%r9, 40(%%rsi)\n"*/
 /*   "movq %%r9, %%r12\n"*/
    /* (r10,r8) += a3 * a3 */
    "mulq %%r14\n"
    "addq %%rax, %%r10\n"
    /* Extract l6 */
    /*"movq %%r10, 48(%%rsi)\n"*/
    /*"movq %%r10, %%rcx\n"*/
    /* Extract l7 */
    /*"movq %%r8, 56(%%rsi)\n"*/
    /*"movq %%r8, %%r14\n"*/
    :
    : "S"(l), "D"(a->d)
    : "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");
        
      __asm__ __volatile__(
    /* Preload. */
  /*  "movq 32(%%rsi), %%r11\n" */
  /*  "movq 40(%%rsi), %%r9\n" */
  /*   "movq 48(%%rsi), %%r10\n" */
  /*   "movq 56(%%rsi), %%r8\n" */
  /*  "movq 0(%%rsi), %%rbx\n"  */
 /*   "movq %%rcx, %%r13\n"*/
    "movq %3, %%rax\n"
    "adcq %%rdx, %%r8\n"
    "mulq %%r11\n"
    "xor %%ecx, %%ecx\n"
    "xorq %%r12, %%r12\n"
    "xorq %%r14, %%r14\n"
    "addq %%rax, %%rbx\n" /*q0 into rbx*/
    "adcq %%rdx, %%rcx\n"
 /*   "addq 8(%%rsi), %%rcx\n" */
    "addq %%r15, %%rcx\n"
    "mov $0, %%r15d\n"
    "movq %3, %%rax\n"
    "adcq %%r12, %%r15\n"
    "mulq %%r9\n"
    "addq %%rax, %%rcx\n" /*q1 stored to rcx*/
    "adcq %%rdx, %%r15\n"
    "movq %4, %%rax\n"
    "adcq %%r12, %%r14\n"
    "mulq %%r11\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r15\n"
    "adcq %%r12, %%r14\n"
    "addq 16(%%rsi), %%r15\n"
    "adcq %%r12, %%r14\n"
    "movq %3, %%rax\n"
    "adcq %%r12, %%r12\n"
    "mulq %%r10\n"
    "movq %4, %%rsi\n"  
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r14\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r9\n"
    "addq %%rax, %%r15\n"
    "adcq %%rdx, %%r14\n"
    "adcq $0, %%r12\n"
    "movq %3, %%rax\n"
    "addq %%r11, %%r15\n" /*q2 into r15*/
    "adcq $0, %%r14\n"
    "adcq $0, %%r12\n"
    "addq %%r13, %%r14\n"
    "movq $0, %%r13\n"
    "adcq $0, %%r12\n"
    "adcq $0, %%r13\n"
    "mulq %%r8\n"
    "addq %%rax, %%r14\n"
    "movq %%rsi, %%rax\n"  
    "adcq %%rdx, %%r12\n"
    "adcq $0, %%r13\n"
    "mulq %%r10\n"
    "addq %%rax, %%r14\n"
    "adcq %%rdx, %%r12\n"
    "adcq $0, %%r13\n"
    "addq %%r14, %%r9\n" /* q3 into r9*/
    "adcq $0, %%r12\n"
    "movq %%rsi, %%rax\n"
    "movq $0, %%r14\n"
    "adcq $0, %%r13\n"
    "mulq %%r8\n"
    "addq %%rax, %%r12\n"
    "adcq %%rdx, %%r13\n"
    "adcq %%r14, %%r14\n"
    "addq %%r12, %%r10\n" /*q4 into r10*/
    "adcq $0, %%r13\n"
    "adcq $0, %%r14\n"
    "addq %%r8, %%r13\n" /* q5 into r13 */
    "movq %3, %%rax\n"
    "movq %%rax, %%r12\n"
    "adcq $0, %%r14\n" /*q6 into r14*/
  
/* %q5 input for second operation is %q0 output from first / RBX as the connecting link
    %q6 input for second operation is %q1 output from first / RCX as the connecting link
    %q7 input for second operation is %q2 output from first / R15 as the connecting link
    %q8 input for second operation is %q3 output from first / r9 as the connecting link
    %q9  input for second operation is %q4 output from first / r10 as the connecting link*
    %q10 input for second operation is %q5 output from first / r13 as the connecting link*
    %q11 input for second operation is %q6 output from first  / r14 as the connecting link */    
    
    /* Reduce 385 bits into 258. */

    "mulq %%r10\n"
    "xorq %%r8, %%r8\n"
    "xorq %%r11, %%r11\n"
    "addq %%rax, %%rbx\n" /* q0 output*/
    "adcq %%rdx, %%r8\n"
    "addq %%rcx, %%r8\n"
    "movq %%r12, %%rax\n"
    "mov $0, %%ecx\n"  
    "adcq %%r11, %%r11\n"
    "mulq %%r13\n"
    "addq %%rax, %%r8\n"
    "adcq %%rdx, %%r11\n"
    "movq %%rsi, %%rax\n"
    "adcq %%rcx, %%rcx\n"
    "mulq %%r10\n"
    "addq %%rax, %%r8\n" /* q1 output */
    "movq %%r12, %%rax\n"
    "adcq %%rdx, %%r11\n"
    "adcq $0, %%rcx\n"
    "xorq %%r12, %%r12\n"
    "addq %%r15, %%r11\n"
    "adcq %%r12, %%rcx\n"
    "movq %%rax, %%r15\n"
    "adcq %%r12, %%r12\n"
    "mulq %%r14\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r13\n"
    "addq %%rax, %%r11\n"
    "adcq %%rdx, %%rcx\n"
    "adcq $0, %%r12\n"
    "addq %%r10, %%r11\n" /* q2 output */
    "adcq $0, %%rcx\n"
    "adcq $0, %%r12\n"
    "addq %%r9, %%rcx\n"
    "movq %%rsi, %%rax\n"
    "adcq $0, %%r12\n"
    "mulq %%r14\n"
    "addq %%rax, %%rcx\n"
    "adcq %%rdx, %%r12\n"
    "addq %%r13, %%rcx\n"    /* q3 output */
    "adcq $0, %%r12\n"
    "movq %%r15, %%rax\n"
    "addq %%r14, %%r12\n" /* q4 output */
    
/* %q1 input for next operation is %q0 output from prior / RBX as the connecting link
    %q2 input for next operation is %q1 output from prior / r8 as the connecting link
    %q3 input for next operation is %q2 output from prior / R11 as the connecting link  
    %q4 input for next operation is %q3 output from prior / RCX as the connecting link
    %q5 input for next operation is %q4 output from prior / r12 as the connecting link   */
        
    /* Reduce 258 bits into 256. */

    "mulq %%r12\n"  
    "addq %%rbx, %%rax\n"
    "adcq $0, %%rdx\n"
    "movq %%rax, %%r14\n"  /* 0(q2) output */
    "movq %%rdx, %%r9\n"
    "xorq %%r10, %%r10\n"
    "addq %%r8, %%r9\n"
    "movq %%rsi, %%rax\n"
    "adcq %%r10, %%r10\n"
    "mulq %%r12\n"
    "addq %%rax, %%r9\n" /* 8(q2) output */
    "adcq %%rdx, %%r10\n"
    "xor %%ebx, %%ebx\n"
    "addq %%r12, %%r10\n"
    "adcq %%rbx, %%rbx\n"
    "movq $0xffffffffffffffff, %%r8\n"
    "addq %%r11, %%r10\n" /* 16(q2) output */
    "movq $0, %%r11\n"
    "adcq $0, %%rbx\n"
    "addq %%rcx, %%rbx\n"  /* 24(q2) output */
    "adcq $0, %%r11\n" /* c  output */

    
/*FINAL REDUCTION */
    
/*    r14 carries ex 0(%%rdi),
       r9 carries ex 8(%%rdi),
       r10 carries ex 16(%%rdi),
       rbx carries ex 24(%%rdi)
       r11 carries c */
    "movq $0xbaaedce6af48a03b,%%r12\n"
    "movq $0xbaaedce6af48a03a,%%rcx\n"
    "movq $0xbfd25e8cd0364140,%%r13\n"
    "cmp   %%r8 ,%%rbx\n"
    "setne %%dl\n"
    "cmp   $0xfffffffffffffffd,%%r10\n"
    "setbe %%al\n"
    "or     %%eax,%%edx\n"
    "cmp  %%rcx,%%r9\n"
    "setbe %%cl\n"
    "or     %%edx,%%ecx\n"
    "cmp  %%r12,%%r9\n"
    "movzbl %%dl,%%edx\n"
    "seta  %%r12b\n"
    "cmp  %%r13,%%r14\n"
    "movzbl %%cl,%%ecx\n"
    "seta  %%r13b\n"
    "not   %%ecx\n"
    "not   %%edx\n"
    "or     %%r13d,%%r12d\n"
    "movzbl %%r12b,%%r12d\n"
    "and   %%r12d,%%ecx\n"
    "xor    %%r12d,%%r12d\n"
    "cmp   %%r8,%%r10\n"
    "sete  %%r12b\n"
    "xor   %%r13d,%%r13d\n"
    "and   %%r12d,%%edx\n"
    "or     %%edx,%%ecx\n"
    "xor   %%edx,%%edx\n"
    "add  %%ecx,%%r11d\n"
    "imulq %%r11,%%r15\n"
    "addq  %%r15,%%r14\n"
    "adcq  %%rdx,%%r13\n"  
    "imulq %%r11,%%rsi\n"
    "xorq %%r15,%%r15\n"
    "xor   %%eax,%%eax\n"
    "movq  %%r14,0(%q2)\n"
    "xor   %%edx,%%edx\n"
    "addq %%r9,%%rsi\n"
    "adcq %%rdx,%%rdx\n"
    "addq %%rsi,%%r13\n"
    "movq %%r13,8(%q2)\n"
    "adcq %%rdx,%%r15\n"
    "addq %%r11,%%r10\n"
    "adcq %%rax,%%rax\n"
    "addq %%r15,%%r10\n"
    "movq %%r10,16(%q2)\n"
    "adcq $0,%%rax\n"
    "addq %%rbx,%%rax\n"
    "movq %%rax,24(%q2)\n"
    : "=D"(r)
    : "S"(l), "D"(r), "n"(SECP256K1_N_C_0), "n"(SECP256K1_N_C_1)
    : "rax", "rbx", "rcx", "rdx", "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15", "cc", "memory");      
    
#else
    uint64_t l[8];
    secp256k1_scalar_sqr_512(l, a);
    secp256k1_scalar_reduce_512(r, l);
#endif    
}


This, measured right now, gives

(original)
scalar_sqr: min 0.134us / avg 0.135us / max 0.136us
scalar_mul: min 0.141us / avg 0.143us / max 0.144us
scalar_inverse: min 40.5us / avg 40.6us / max 40.9us

(my hacked version - only gcc)
scalar_sqr: min 0.122us / avg 0.122us / max 0.122us
scalar_mul: min 0.126us / avg 0.127us / max 0.127us
scalar_inverse: min 36.7us / avg 36.9us / max 37.1us

The way the original code is (very readable for maintenance though - unlike my crap), if one dissassembles it, shows something like that:

1) mul512 or sqr512 starts and then writes its output to variables

2) Then we have pops and pushes for the next function which is reduce512

3) The reduce512 function imports the data from the outputs of #1

4) Reduce512 goes in 3 stages with each stage writing its own distinct output to variables and then the next stage imports it as its input. (The three stages can be streamlined by merging them - always using registers. The necessity for distinct output points and input points is then redundant / less moves and no need for variables).

5) As reduce512 ends, it puts its own output to variables

6) Final reduction imports the output of (5) and processes it.

My rationale was that if mul512 OR sqr 512+reduce512+final reduction go together, in one asm, one saves a lot of inputs/outputs and pops/pushes. Plus code size goes down significantly (1300 bytes => 1000 bytes) which leaves some extra L1 cache for other stuff. Reduce512/mul512/sqr512 still exists as code (altered) but they aren't really called. What gets called is the unified secp256k1_scalar_mul and the secp256k1_scalar_sqr - which have everything inside them. This was proof of concept so to speak, because I was seeing the disassembled output and I was like "AARRRGGHHH why can't one stage or function simply forward its results with the registers and there is all this pushing and popping and ram and variables".

For example, this is the behavior between (5) and (6) in the disassembled output of the original reduce512:

 406246:   4c 89 4f 10             mov    %r9,0x10(%rdi)
  40624a:   4d 31 c9                xor    %r9,%r9
  40624d:   49 01 f0                add    %rsi,%r8
  406250:   49 83 d1 00             adc    $0x0,%r9
  406254:   4c 89 47 18             mov    %r8,0x18(%rdi)
  406258:   4c 89 cb                mov    %r9,%rbx
  40625b:   4c 8b 5f 18             mov    0x18(%rdi),%r11
  40625f:   48 8b 77 10             mov    0x10(%rdi),%rsi

My thoughts were like "ok, these ram moves are redundant and HAVE TO GO". Why should r8 write to ram and then get reimported from ram to r11? Why should r9 go to ram and get re-imported instead of going straight to rsi? Waste of time". I had the same reaction every time I spotted data going out and then getting moved back in as input - instead of being used as is).

Still, the source is very readable the way it is right now and the performance tradeoff is not that large compared to understanding what each thing does.
legendary
Activity: 2058
Merit: 1416
aka tonikt
January 14, 2017, 05:02:30 PM
#29
Why don't you want to discuss the protocol  changes needed for bootstrapping clients with utxo snapshots?
Because you obviously don't.
Please explain me: if it isn't about your ego,  then what is it about?

Oh... No comments?
Well, let me guess, then...

You can't talk about it, because it happens that this specific feature is being researched by a company that you work and signed an NDA for?

Maybe it's indeed not about your ego.
Maybe it's just about money.
legendary
Activity: 2058
Merit: 1416
aka tonikt
January 14, 2017, 02:41:42 PM
#28
Mind that later I spent some time with sipa asking him how to realize the bip 37 solution for fetching the missing transactions. He came back to me the next day saying that it wasn't actually possible.
So yes,  I was listening.
But at that moment it was pretty clear that in order to improve block downloading times some network protocol changes were needed. Whilst the message I got from you was that you weren't willing to make any changes because you didn't seem to see block downloading times as a problem worth solving.

Just like at this moment you don't seem to see the bootstrapping time to be a problem worth solving.
Although not because it's not really worth solving, but because you want to solve it yourself.
Because you only care about a progress in Bitcoin development when it goes along the way with indulging your ego.
If it doesn't,  then it has to wait for when you have time.
staff
Activity: 4326
Merit: 8951
January 14, 2017, 01:59:07 PM
#27
No - I came to you saying that I would like to work on improving the block download times.

I really suggest you read the log. That might be what you were intending on doing; but what you did was some and insist on a specific protocol change to allow downloading blocks in tiny chunks.

We explained why this was unlikely to improve performance, was likely to create vulnerabilities, and suggested some things that we thought more likely to be successful.  We also suggested you try making the changes in your own protocol so you could measure the result, and not just believe us or rely on speculation.

Quote
Why don't you want to discuss the protocol  changes needed for bootstrapping clients with utxo snapshots?
Because you obviously don't.

Huh? Because it's far far offtopic and you're just switching subjects-- but I've talked about this many times here before. If you're referring to just trusting miners to give you a correct UTXO snapshot it is a _massive_ change in the security model-- and if you're happy with that change you can just use SPV.  If you're talking about static cached UTXO data, then there isn't any consensus change needed... and software implementing that just ... implements it.

Quote
And you told me off,  saying basically that it wasn't necessary. Plus also some other bullshit about BIP37 that didn't turn out to be true.

Code:
10:03 < tonikt> Like being able to download a block in fragments
10:03 < sipa> tonikt: BIP37 allows that, in a way
[...]
10:05 < gmaxwell> tonikt: Not a very worthwhile thing in my opinion, and as sipa points out its already possible.

What you were specifically suggesting wasn't necessary or helpful. And what we told you about BIP37 was true (afaict), you could have used it to download chunks of blocks. (and though it wasn't mentioned there Pieter had already tried it out, though mostly to eliminate already transfered txns... but the overheads made it not a useful improvement).

Quote
But then few years later,  suddenly improving the block download times became so much desirable and you've been so proud of you delivering it to the public.
Pathetic
Don't confuse some specific proposal being unhelpful for the goal which you hoped for it which it could not accomplish with that goal being an unhelpful.

You were very aggressively pushing a particular change based on a flawed belief that it would help, all I was doing was pointing out why I didn't expect that particular change to be helpful (that transferring data in 32KB chunks would be slower due to latency/overhead and more vulnerable), as well as suggesting what you could to prove out the effectiveness of the change or do it on your own without worrying about what we thought.  It seems you only understand working on things exactly the way you think they should be done, and believe that anyone who doesn't is saying they don't care about improvement. I think you should consider some other possibilities.
legendary
Activity: 2058
Merit: 1416
aka tonikt
January 14, 2017, 01:49:35 PM
#26
Where did I say that compact blocks were 'my proposal'?

No - I came to you saying that I would like to work on improving the block download times.
And you told me off,  saying basically that it wasn't necessary. Plus also some other bullshit about BIP37 that didn't turn out to be true.

Code:
10:05 < gmaxwell> tonikt: Not a very worthwhile thing in my opinion, and as sipa points out its already possible.

But then few years later,  suddenly improving the block download times became so much desirable and you've been so proud of you delivering it to the public.
Pathetic
 
legendary
Activity: 2058
Merit: 1416
aka tonikt
January 14, 2017, 01:37:18 PM
#25
I'm just explaining you how you are being perceived.
Whether your intentions were not to tell the guy of,  but maybe only to encourage him into a more productive way of thinking - that's a different story.

And no, I'm not going to look through the IRC logs from several years ago,  just to prove to myself that I haven't dreamed about something and it actually happened. Because,  believe me or not,  I seriously don't give a fuck whether you believe me or not Smiley

Why don't you want to discuss the protocol  changes needed for bootstrapping clients with utxo snapshots?
Because you obviously don't.
Please explain me: if it isn't about your ego,  then what is it about?
staff
Activity: 4326
Merit: 8951
January 14, 2017, 01:12:24 PM
#24
Repeating the some unsubstantiated claim doesn't make it substantiated.   You just made a claim about this very thread-- and yet when we look-- nothing behind it.

Quote
Just me, years ago, trying to talk on #bitcoin-dev about ways to speed up block downloading times. That was a no no - block propagation didn't need improving because you didn't think it was important... It almost got me banned from the channel..  And yet we are; 2016 - a new brilliant feature that the bitcoin core team is so proud of: compact fucking blocks!

The logs are all public, please point us to this conversation: http://bitcoinstats.com/irc/bitcoin-dev/logs/2011/01   If you can recall a name/phrase from the discussion, I will happily search for it.

Edit: Reading all logs from you in that channel only took a couple minutes, here is the conversation (trimming irrelevant side discussions):


10:02 < tonikt> Hi guys. I was wondering whether there have been any discussion on changes in the network protocol?
10:03 < tonikt> Like being able to download a block in fragments
10:03 < grau> tonkit: there is already bloom filter download for blocks
10:03 < tonikt> Or: see a transaction size, before downloading it
10:03 < sipa> tonikt: BIP37 allows that, in a way
10:03 < tonikt> grau: bloom filter is not good for full node
10:04 < grau> block fragments are no good for full node either
10:04 < tonikt> grau: why not?
10:04 < sipa> you can have two complementary bloom filters set on two connections
10:04 < sipa> and download half blocks from each
10:04 < tonikt> so the answer is 'no'?
10:05 < gmaxwell> tonikt: Not a very worthwhile thing in my opinion, and as sipa points out its already possible.
10:05 < grau> tonkit: a fragment can not prove that funds are not spent in the other part
10:05 < sipa> the answer is 'partially', BIP37 supports (some way) of downloading blocks in fragments
10:05 < tonikt> I believe the network could really use being able to download one block from diffetent peers in i.e. 32KB chunks
10:05 < sipa> that'd only slow things down imho
10:05 < tonikt> gmaxwell: but I was checking these bloom filters and it wasn't possible to download a block like that
10:05 < sipa> and there is no way to prove that a transaction has a certain size without actually sending that transactions
10:06 < sipa> tonikt: it is
10:06 < gmaxwell> tonikt: it is, but as sipa and I are telling you— it's not obviously useful.

10:06 < grau> tonkit: it is already allowed by the protocol to use several peers for download. BOP actually does that by block not by block fragment
10:06 < tonikt> sipa: yes - it isnt possible to confirm it, but if you see a mismatch you can at least ban the peer
10:06 < sipa> tonikt: set nHashFunctions=1, and use random complementary bits in the filter
10:06 < tonikt> grau: I al talking about a fully synchronized node
10:06 < gmaxwell> tonikt: what are you talking about?
10:07 < tonikt> you suddenly get an INV with a new block - you want to download it ASAP
10:07 < sipa> what is the actual problem you're trying to solve?
10:07 < tonikt> ... so why not to split the work into parts and downlaod it in paralell?
10:07 < sipa> we've just told you how to do that using BIP37

10:07 < tonikt> wait, I cannot read that BIP37 cause wiki is broken
10:08 < gmaxwell> Because doing so will _slow_ transmission except to the extent that it gets you an unfair share of the channel capacity.
10:08 < tonikt> maybe you are talking about a different thing Smiley
10:08 < sipa> 19:06:43 < sipa> tonikt: set nHashFunctions=1, and use random complementary bits in the filter
10:08 < sipa> tonikt: in practice it'd be very hard to coordinate that, but if block sizes would grow a lot, that may be a viable strategy
10:09 < tonikt> just to be clear: BIP37 was about downloading the header and the tx hashes, followed by the actual transactions?
10:09 < sipa> BIP37 = bloom filtering
10:09 < tonikt> yes
10:09 < tonikt> so how does it help me to split a block download amoung my 50 peers?
10:09 < sipa> so you give node A a random filter that selects 50% of the transactions
10:09 < sipa> and you give node B a complementary filter that selects the other 50% of the transactions
10:10 < tonikt> yes, but I have more than 2 peers.
10:10 < sipa> ok, then give node A a filter that selects 33% of the transactions
10:10 < tonikt> your solution seems more like a work around
10:10 < sipa> it's a very neat solution, as you don't need to keep track of what to download from whom
10:10 < tonikt> ok, sorry. let me ask a question then
10:10 < sipa> they figure it out themself
10:11 < grau> tonkit: and you seem to work around a problem not really there until blocks are the size we know.
10:11 < gmaxwell> tonikt: you're just going to end up in N connections all in slow start. Plus users setting your house on fire because you use _all_ their bandwidth in a burst once the connections come out of slowstart.
10:11 < sipa> maybe this will become an actual problem if the block size limit is increased
10:11 < sipa> if it is, we'll deal with it
10:11 < tonikt> why cant there be like a command "getsomething" that would return me the length of that something, plus a list of hashes of its data split into i.e. 4KB chunks
10:11 < tonikt> ... and that would be same for txs and blocks
10:12 < gmaxwell> sipa: sure if there are larger blocks then eventually at some point it makes sense. A few hundred K is really sketchy for any benefit there.
10:12 < tonikt> gmaxwell: no - now I use their bandwidth in a burst, asking each peer for entire block
10:12 < tonikt> I'm talking about a situation when we have a new block mined
10:12 < tonikt> which is every 10 minutes
10:12 < gmaxwell> You don't ask each peer for the entire block.
10:13 < tonikt> well, you should if you care to have it ASAP Smiley
10:13 < tonikt> I do Smiley
10:13 < gmaxwell> You ask a single peer for the entire block, or as sipa points out multiple peers for mutually exclusive subsets.
10:13 < tonikt> yes, I understand that is the status
10:13 < gmaxwell> tonikt: then you're abusing the network.
10:13 < gmaxwell> and also hurting your own performance.
10:13 < tonikt> but I was talking about a possible improvement
10:13 < tonikt> that is how bittorrent works
10:14 < tonikt> yes, call me an abuser Smiley
10:14 < sipa> do you use bittorrent for 1 MB data?
10:14 < tonikt> sometimes
10:14 < gmaxwell> sipa: normally torrents use 4-16 mb chunks.
10:14 < sipa> gmaxwell: i know
10:14 < tonikt> no
10:14 < gmaxwell> tonikt: what you're describing there sucks, because you can't tell who's screwed you and given you invalid data. With what sipa told you to do you can tell.
10:14 < sipa> it can use smaller chunks
10:14 < tonikt> I'm sure they use smaller ones - like 32KB
10:15 < sipa> gmaxwell said "normally"
10:15 < gmaxwell> tonikt: no, yes— it can, but thats not what actually gets used normally (except for tiny files, which torrent takes much longer to transfer than http)
10:15 < tonikt> OK, I get it - someone can send me a corrupt list of block's chunks hashes
10:15 < tonikt> but then...
10:15 < sipa> this is an engineering question, and it depends on very specific data (latency, bandwidth, variation, distribution)
10:16 < sipa> once it becomes a problem
10:16 < sipa> we can find an appropriate solution
10:16 < sipa> and there are several ways to deal with it

10:16 < tonikt> why don't you add a protocol command "give me this tx from this block?"
10:16 < sipa> because that would be absolutely horrible for the peers
10:16 < sipa> they need to look up the block for you, but only give you a small piece of it

10:16 < tonikt> sipa: with all due respect, but for you to find a solution, one needs to wait 2 years in average Smiley
10:17 < tonikt> It is actually quite easy to calculate
10:17 < tonikt> how much time you need to download lets say 1MB block
10:17 < tonikt> having 1Mbps connection - about 10 seconds
10:17 < sipa> right now the problem simply doesn't exist with 1 MB blocks, unless on very slow links (mobile?) where you don't want full blocks anyway
10:18 < tonikt> 10 senconds from peer to peer (and that's not counting checking it)
10:18 < tonikt> dont you think it is already enough to try decreasing it be a few folds?
10:20 < gmaxwell> tonikt: go convince bittorrent to never use chunks larger than 100k and come back. Tongue
10:20 < tonikt> I believe all the API is there already, except that it should be able to download a tx that is already mined, though by specifying a block where it is
10:20 < sipa> if you're on a 1 Mbps connection, there's no way you'll get it faster than in 10s anyway
10:20 < tonikt> gmaxwell: but bittorrent does not care about latency
10:20 < tonikt> while bitcoins hould
10:21 < sipa> if it's your peer that is limited to 1Mbps, but your connection is faster, you won't be downloading from him (as he'll be slower to announce it to you anyway)
10:21 < tonikt> ok, so you dont want to change the net protocol, before it becoming a problem
10:21 < gmaxwell> tonikt: generally parallel fetching is not great for latency.
10:21 < tonikt> fine
10:21 < gmaxwell> As you end up waiting on the most latent response before you can validate any of it.
10:22 < sipa> tonikt: i'm against changing the protocol before having clear information about the benefits
10:22 < sipa> and no, i don't think right now there is much that can be improved
10:22 < tonikt> gmaxwell: think if I could ask a node for a block and all its transaction hashes - and other nodes for transactions
10:22 < gmaxwell> tonikt: no, we're also saying that for the current maximum blocksize what you're suggesting is likely to _hurt_. Without a bunch of analysis it would be hard to know.
10:22 < sipa> in the future that can certainly change

10:22 < tonikt> I just dont see a problem with adding a block hash to inv while asking for tx data
10:23 < tonikt> it doesn't seem like a development challenge
10:23 < sipa> it's not
10:23 < tonikt> and it could solve the problem
10:23 < sipa> it's a maintainance overhead
10:23 < sipa> and a compatibility burden
10:23 < gmaxwell> We could also throw some virgins into volcanos, that might "solve the problem"
10:23 < gmaxwell> (not that you've established that there is even a problem to be solved)
10:24 < tonikt> ok, I get it - you don't see a problem
10:24 < sipa> *yet*
10:24 < tonikt> you are obviously not so much a perfectionists, as I am Tongue
10:24 < sipa> well we're not dealing with a nicely theoretical problem where the optimal soluton is obvious
10:24 < sipa> everything has downsides
10:24 < gmaxwell> tonikt: I don't see evidence of perfectionism in you in this discussion. If there were you'd be doing some careful analysis to establish some tests to determine the level of improvement possible.
10:25 < gmaxwell> Instead you're just shooting from the hip with a blind guess at something that would maybe help or hurt a problem which may exist in the future.
10:25 < tonikt> gmaxwell: all I can tell you is that, if I want to download a block ASAP, I ask each of my peers for the entire one - is this perfect for you?
10:26 < gmaxwell> tonikt: that won't actually fetch you the block faster than asking a single peer in many cases.

10:26 < tonikt> if I could download it in parts/transactions - that would be perfect, unless I'd screw up my implementation
10:26 < gmaxwell> No, in fact it wouldn't be.
10:26 < tonikt> Smiley no, it would
10:27 < sipa> you'd still have to wait for the slowest one to respond
10:27 < gmaxwell> tonikt: or look at it another way— why not ask them each for one bit of it?
10:27 < gmaxwell> as sipa says, you have to wait for the slowest response.
10:27 < sipa> if the only constraint is bandwidth, and processing speed and latency don't exist, your solution is optimal
10:27 < gmaxwell> sipa: and overhead doesn't exist.
10:27 < tonikt> guys, do I really need to explain you how to implement it?
10:27 < sipa> and attackers
10:28 < tonikt> you just have a list of txs to download and your peers - whenever any of them is busy, you ask him for the next ts
10:28 < gmaxwell> Right, so in a world where the only constrain is the remote peers bandwidth, and process speed, latency, attackers, and overhead don't exist— then indeed, thats optimal.
10:28 < tonikt> it must be faster
10:28 < sipa> how do you know he's busy?
10:28 < sipa> by waiting?
10:28 < tonikt> i know he is busy because he has not responded to my previous data request yet
10:29 < gmaxwell> It's unknowable without waiting because of latency.
10:29 < tonikt> so it does not make any sense to ask him for more data
10:29 < gmaxwell> tonikt: if he's 80ms away he _cannot_ answer faster than 80ms.
10:29 < tonikt> sure - but you have 1000 txs
10:29 < sipa> and you need all of them
10:29 < tonikt> eventually he will answer and you will have 900+ left anyway
10:29 < gmaxwell> so you're going to fetch them one at a time without pipelining? lol good luck with that.
10:29 < tonikt> and then you will ask him for the 913th one
10:29 < bmcgee> … I get the impression now's not a good time for asking potentially stupid questions …
10:30 < tonikt> gmaxwell: it's at least 200+ bytes
10:30 < gmaxwell> bmcgee: well, your question doesn't have a neat answer.
10:30 < sipa> tonikt: that is all very sensible, given the right bandwidth/latency tradeoffs
10:30 < tonikt> but youre right, it would be better to ask for several tx at the same time
10:30 < sipa> tonikt: it's something you certainly want to do at the ~megabyte level
10:30 < tonikt> except that some of thme may be 100kb big
10:30 < gmaxwell> tonikt: lol. just the TCP overhead from the request is going to instantly give you 50% overhead on 200 bytes.
10:30 < sipa> (and we don't by the way, so let's fix that first)
10:31 < tonikt> gmaxwell: now it gives you much more than 50%
10:31 < tonikt> most of the txs you have already anyway
10:31 < gmaxwell> tonikt: no, it doesn't the overhead on transfering a block is about 2%.
10:31 < sipa> tonikt: again BIP37 to the rescue (it doesn't send transactions it knows you already have, without extra latency)
10:32 < gmaxwell> okay, ignoring that. But as sipa says bip37 takes care of that.
10:33 < tonikt> gmaxwell: but using bip37 is working around a solution. if I need this tx from this block - why do I need to bother with bloom filters and statistics?
10:33 < sipa> tonikt: because it allows you to do things without extra latency
10:33 < sipa> tonikt: you don't have to be told about the list of transactions first, and you don't have to reply with which transactions you want
10:34 < gmaxwell> tonikt: statistics?! you set a series of complemetary bits. and don't have to taken another 80ms of round trip time to send extra tiny requests with redundant hashes.
10:34 < tonikt> sipa: but the biggest latency comes not from the ping - it somes from the data that needs to be finished, before they can be used
10:34 < gmaxwell> @#@*$(@#
10:34 < sipa> tonikt: on a 10 Mbit/s link, a not outrageous 200ms ping time (meaning 400ms extra for a roundtrip) means 400 kilobyte that could have been downloaded while they just waited for you to answer
10:34 < gmaxwell> tonikt: go look at the bandwidth numbers you gave before! on a 1mbit connection the data to transfer a fee hundred bytes is way less than typical latency.
10:35 < gmaxwell> er the time to transfer.
10:35 < sipa> that's more than an average block
10:36 < tonikt> ok guys, whatever. I see you have your world and don't really want to notice  mine.  I guess I will have to wait for it to become a problem Smiley

10:36 < tonikt> but if I might add something, not as a question, but as a proposal
10:37 < gmaxwell> Yes, my world has latency in it. Not sure where you get one that doesn't, but I'd like one of those. Tongue
10:37 < tonikt> 1) allow to ask for a size of a transaction/block before downloading it (so you can ban anyone who is trying to send you more)
10:37 < sipa> tonikt: what if the peer lies?
10:38 < gmaxwell> sipa: there are no attackers in tonikt's world.
10:38 < tonikt> 2) imagine that you are connected to 30 peers and a new 1MB block have just been mined: what is the fastest way to download it from your peers?
10:38 < sipa> depends on your bandwidth and latency
10:38 < gmaxwell> tonikt: if you think you can do better, just write a second transfer protocol. If it's better it should be easy to demonstrate. You'll probably learn something in the process.

10:38 < tonikt> sipa: if the peer lies, than you will find it out and ban it
10:38 < sipa> with very low bandwidth and low latency at the same time, downloading in parallel will certainly be faster
10:39 < tonikt> gmaxwell: believe me, I can write a protocol, but it would be quite silly to test it in my home network
10:39 < gmaxwell> tonikt: then use a network simulator.
10:39 < gmaxwell> It's pretty straight forward to simulate actual network behavior.
10:39 < sipa> tonikt: well to deploy it, we'd first need a way to download *anything* in parallel first

10:40 < tonikt> gmaxwell: but I dont need to simulate it to know that downloadin a block in parts from several peers at the same time will be faster
10:40 < gmaxwell> tonikt: But you're incorrect. The way you're describing that involves lots of round trips will actually be _slower_ in a case where there is considerable latency.
10:41 < gmaxwell> The exact balance depends on a number of factors.
10:41 < gmaxwell> Basically you never want to make a request that is smaller than the bandwidth delay product.

10:41 < tonikt> gmaxwell: no, becasue shorter transactions could go in bulk (like 20 * 200+ bytes)
10:41 < tonikt> .. thats why you need to have a way to find out tx size before downloadin it
10:42 < sipa> and just finding out that size means an extra round-trip, which (in some cases) may be slower than just downloading the whole block
10:42 < gmaxwell> Now you're transmitting a bunch of data to make those decisions, and then you can do nothing until it shows up. During that time you could have sent a whole block.
10:43 < tonikt> I think we should be talking numbers here, otherwise it's just baseless accusations
10:43 < gmaxwell> There are numbers above.
10:44 < tonikt> Can we agree that an average node would have 1mbps upload speed?
10:44 < tonikt> download is bigger - I know
10:44 < gmaxwell> Then when you get bad data it's impossible to tell who is giving you bad data until you have the whole block... which means that you have to fetch it from one peer if some peers is giving you bad data.. pretty cheap dos attack.

10:45 < tonikt> gmaxwell: of course you can say who sent you bad data, because you ask for transactions, which hashes you know
10:45 < gmaxwell> tonikt: e.g. I give you the wrong hashes for the block.
10:45 < tonikt> gmaxwell: with a proper difficulty? Smiley
10:45 < gmaxwell> huh?!
10:46 < tonikt> I can live with that
10:46 < gmaxwell> ...
10:46 < tonikt> I will compare the hashes against the merkle from the block
10:46 < sipa> which you don't have yet?
10:46 < gmaxwell> So now you have to fetch the whole merkle tree first. keep adding overhead.. (you'll end up with bip37 in a few more minutes)

10:46 < tonikt> The header is 80 bytes long
10:47 < sipa> the average transaction is 250 bytes or so
10:47 < sipa> a txid is 32
10:47 < tonikt> the minimal transaction is 250 or so Smiley
10:47 < sipa> that means you have to download 1/8 of the block's size before you can make any decision
10:48 < tonikt> I can live with that as well
10:48 < gmaxwell> tonikt: and again, you don't need to change the p2p protocol to expirement— you can just use an alternative p2p protocol, and simulate actual internet conditions.

10:48 < tonikt> its still 8:1 compression Smiley
10:48 < tonikt> I know what I can experiment with, guys
10:49 < tonikt> its just that I dont need to experiment to know that it would be a good thing to do
10:49 < gmaxwell> There are some bandwidth delay mixtures where some strategies are better than other ones, and different strategies are better in other conditions.
10:49 < tonikt> like this think recently, that they fixed
10:49 < tonikt> a peer sends you a longer tx than it should be
10:50 < sipa> that was actually an intentional design decision
10:50 < tonikt> you should ban it - but you cant, because it is likely a legit client, with a bug Smiley
10:50 < gmaxwell> tonikt: imagine— for a moment— that you peers are on mars with a 40 minute latency, and your bandwidth to each peer is 1gbit/sec, and your pay 1 BTC per megabyte transfered in aggregate. What is the optimal strategy?
10:50 < tonikt> gmaxwell: in this case I get your point
10:51 < tonikt> ... but I thought that we were on Earth
10:51 < phantomcircuit> gmaxwell, put your btc node on earth and send it instructions
10:51 < gmaxwell> tonikt: I used an extreme example because you keep rejecting the idea that different situations require different tradeoffs and you keep suggesting ones which have additional round trips, when it's possible to do this _without_ adding them.. so clearly you're not thinking about _something_.
10:51 < sipa> tonikt: banning would be a very bad idea - not because there are buggy clients that add random junk to transactions, but because you'd be hurting the ones forwarding an attacker's junk, not the attacker themself
10:52 < tonikt> so really, please, add an option to ask for tx/block size without a need to download it and allow to do getdata for tx giving a block hash as a reference - that's all I ask Smiley
10:52 < gmaxwell> banning on transitive behavior is a superfantastic way to convert mining dos attacks into network partitioning.
10:52 < gmaxwell> tonikt: We will not add that. Sorry.
10:53 < tonikt> gmaxwell: I know Smiley
10:53 < tonikt> but dont tell me later, that I did not suggest it Tongue
10:53 < sipa> tonikt: the first is unauthenticated data (someone can just lie, and you can claim you protect against it, but your optimal behaviour still depends on them being honest - i really prefer solutions that do not need such an assumption)
10:53 < gmaxwell> tonikt: Please create an alternative transport— in the process you'll learn something about the evils of roundtrips for performance, come up with a better proposal which is potentially useful.
10:54 < gmaxwell> Even a protocol that depends on honesty would be not the end of the world as an alternative transport: just use it between friends.

10:54 < tonikt> sipa: but if someone lies, you will find it out and you will ban it - that is the whole point
10:54 < gmaxwell> but it's not something that makes a lot of sense as the standard p2p protocol.
10:54 < sipa> tonikt: and you'll still have lost time doing so
10:54 < sipa> tonikt: something the attacker may not care about, but you do
10:55 < tonikt> sipa: yes, but you will pay this time to ban the bastard. now you have the same problem, but you cannot ban the bastared
10:55 < gmaxwell> tonikt: IPs are cheap, we regularly get trolls on IRC with access to thousands of IPs. Someone doing that could force all your nodes into wasting a ton of bandwidth and fall back to single peer fetching.
10:55 < gmaxwell> (and make you take many times longer to fetch the block)
10:55 < gmaxwell> I fully endorse my mining competition adopting such a protocol. Tongue
10:55 < tonikt> gmaxwell: 99kb is probably cheaper than in IP
10:55 < tonikt> an*
10:56 < tonikt> so I can download 99kb from the IP,  just to ban it for being wrong
10:56 < tonikt> especially if I know the size up front and I do not donload anything bigger than 10kb
10:57 < gmaxwell> or, you could, you know, use a protocol which doesn't depend on unauthenticated data and which doesn't require extra round
 trips.. and still fetches in parallel (if the bandwidth/latency ratios make it profitable to do so)
10:57 < tonikt> .. and if I see the 10001th byte - I can it already at that moment

10:58 < gmaxwell> And, in fact, BIP37 already gives us that, along with automatic (zero round trip) elimiating of known-already-sent data.
10:58 < tonikt> bip37 was a nice invention. the only problem with it is that nobody wants to use it
10:59 < tonikt> why wont you once invent something that people would want to use it, for a change? Wink
10:59 < gmaxwell> what are you talking about??
10:59 < gmaxwell> lol
10:59 < gmaxwell> every peer connected to me at the moment supports bip37.
10:59 < tonikt> supports, but does not get adventage of it
10:59 < gmaxwell> Now I think you're just trolling.
11:00 < tonikt> I guess you can always find someone who'd kick me out Smiley
11:00 < sipa> between satoahi clients, he's right
11:00 <@gmaxwell> I suppose I could. Tongue
11:00 <@gmaxwell> but seriously, what the heck.
11:00 < sipa> but my cell phone just loves bip37

11:01 < gmaxwell> tonikt: all the bitcoinj clients happily use it— but we don't think parallel fetching is currently useful in the satoshi client.
11:01 < phantomcircuit> iirc the fetch queue ends up pulling more blocks even before the queue has been processed right
11:01 < gmaxwell> After a bunch of archectural changes it might be useful.

11:02 < sipa> it's definitely useful at the block level
11:02 < phantomcircuit> so the pipeline stays full
11:02 < sipa> and we don't even do it there
11:02 < sipa> and we absolutely should
11:02 < gmaxwell> sipa: ::nods::
11:03 < sipa> phantomcircuit: yes, you can have up to 500 queued getdata requests
11:03 < sipa> whuch may take minutes to download
11:03 < gmaxwell> Doesn't involve the unauthenticated data / latency tradeoffs. And couple hundred k blocks are large enough that there isn't an overhead tax from doing that— beyond fetching the headers seperately.
11:11 < tonikt> so anyway guys, to wrap up, I did not mean to be mean, just to indicate my needs. and I appreciate your advise, but I am not going to make a network simulation just to convince you Smiley


So, you proposed instead requesting blocks 32KB at a time using more round trips-- basically demanding protocol changes without doing any testing or analysis to determine the benefit. We invited you to try out the protocol to observe the results, not merely telling you why we expected in common cases it would harm performance.

What you proposed is basically the _opposite_ of compact blocks (which eliminates round trips, and avoids transmitting most of the block at all). Sad Disappointing that you are posting here claiming it was your proposal.
legendary
Activity: 2058
Merit: 1416
aka tonikt
January 14, 2017, 12:37:38 PM
#23
Seriously man, this happens all the time.
Whenever a person from outside the circle comes with an idea, you either tell him that the idea is stupid or just not worth working on.

I've seen it so many times,  that I'm sick of it already.
Just me, years ago, trying to talk on #bitcoin-dev about ways to speed up block downloading times. That was a no no - block propagation didn't need improving because you didn't think it was important... It almost got me banned from the channel..  And yet we are; 2016 - a new brilliant feature that the bitcoin core team is so proud of: compact fucking blocks!

Or: How many times have I tried to discuss a way of solving the bootstrapping issue once and for all by extending the protocol to allow a secured distribution of  the utxo db?
No fucking way to discuss it,  because first you don't find it important, then you are 'way ahead'  of me with the design,  and then you still don't know how to do the hashing of the fucking records - and that took you like 4 years to realize....
Just like I had said: at the end it's still going to be done,  except that it will take at least 10 years, because you're too busy now with other stuff and this specific thing is too big of a deal for your ego to let anyone from outside to claim credit for solving.
staff
Activity: 4326
Merit: 8951
January 14, 2017, 11:49:28 AM
#22
Just go back to the top of this topic.
Guys came with some ideas to optimize the lib - maybe not the most brilliant ones,  but also definitely not a stupid ones...
And what was your reaction?
You basically told them off.
You do it all the time.

Wtf?  Someone asked about two publications, asking if each would be helpful.   I responded that one would likely not be, the other would be somewhat and I pointed out it would be fairly easy to try. I certainly didn't tell them off!

I'll upload here 2 versions of /src/field_5x52_asm_impl.h that I've kind of hacked, one using memory, the other xmm registers.

The commentary is not good because it's not production level - just fooling around* with the data flow so that the data get from one end to the other faster, with less code imprint. I've never had them run on anything beside my Q8200, and I'm wondering on the behavior of modern cpus. I'd appreciate if you (or anyone else) can run a benchmark (baseline) + these 2, and perhaps a time ./tests as a more real-world performance.

If I do a ./time tests, both run faster by a second (58.2 seconds baseline with endomorphism down to 57.2 seconds in my underclocked Q8200 @ 1.86gz), although the memory version seems faster in the benchmarks. I have a theory on
Neat.  You shouldn't benchmark using the tests: they're full of debugging instrumentation that distorts the performance and spend a lot of their time on random things.  Compile with --enable-benchmarks and use the benchmarks. Smiley


A quick check on i7-4600U doesn't give a really clear result:


Before:
field_sqr: min 0.0915us / avg 0.0917us / max 0.0928us
field_mul: min 0.116us / avg 0.116us / max 0.117us
field_inverse: min 25.2us / avg 25.7us / max 28.5us
field_inverse_var: min 13.8us / avg 13.9us / max 14.0us
field_sqrt: min 24.9us / avg 25.0us / max 25.2us
ecdsa_verify: min 238us / avg 238us / max 239us

After (v1):
field_sqr: min 0.0924us / avg 0.0924us / max 0.0928us
field_mul: min 0.117us / avg 0.117us / max 0.117us
field_inverse: min 25.4us / avg 25.5us / max 25.9us
field_inverse_var: min 13.7us / avg 13.7us / max 14.0us
field_sqrt: min 25.1us / avg 25.3us / max 26.1us
ecdsa_verify: min 237us / avg 237us / max 237us

After (v2):
field_sqr: min 0.0942us / avg 0.0942us / max 0.0944us
field_mul: min 0.118us / avg 0.118us / max 0.119us
field_inverse: min 25.9us / avg 26.0us / max 26.4us
field_inverse_var: min 13.6us / avg 13.7us / max 13.8us
field_sqrt: min 25.6us / avg 25.9us / max 27.8us
ecdsa_verify: min 243us / avg 244us / max 246us


legendary
Activity: 2058
Merit: 1416
aka tonikt
January 14, 2017, 11:33:23 AM
#21

Quote
Which means that if you want to change any bit there,  you have to play the bitcoin celebrities PR game, which is mostly about indulging a  big egos of a few funny characters.
Which is just something you're purely making up. And it's unfortunate because if other people read it and don't know that you're extrapolating based on your imagination and fears they might not contribute where they otherwise might. That does the world a disservice.

Man, how long have I been here?

What I've said has zero to do with fears and all to do with my observations and experience.

Just go back to the top of this topic.
Guys came with some ideas to optimize the lib - maybe not the most brilliant ones,  but also definitely not a stupid ones...
And what was your reaction?
You basically told them off.
You do it all the time.

If there was a ranking of people scaring newcomers away from contributing into the code,  you'd be on top of it.
Because you always know better. Some  others 'core devs' have quite similar characters. Just a few,  but they are enough to scare new people from contributing. Especially a talented, brilliant people won't be willing to put up with this shit,  that you guys throw at them.
legendary
Activity: 1708
Merit: 1049
January 14, 2017, 10:54:56 AM
#20
There are 3 benchmarks

bench_internal
bench_verify
bench_sign

which are built by ./configure --enable-benchmark

As for the difference in test speed, might have to do with some lines in tests.c which indicate a different number of rounds (plus tests for endomorphism) if endomorphism is on.

Built your program with endomorphism (./configure --enable-endomorphism) and report back with the results, should be faster.

Ok - I did.

bench_verify shows speedup with endomorphism

ecdsa_verify: min 42.0us / avg 42.2us / max 43.0us  (with)
ecdsa_verify: min 57.7us / avg 57.8us / max 58.4us  (without)

bench_internal shows no improvements (within measure tolerance) except one:

wnaf_const: min 0.0887us / avg 0.0920us / max 0.102us (with)
wnaf_const: min 0.155us / avg 0.161us / max 0.171us     (without)

I doubt this would cause the speedup from above.

Rico


I'll upload here 2 versions of /src/field_5x52_asm_impl.h that I've kind of hacked, one using memory, the other xmm registers.

The commentary is not good because it's not production level - just fooling around* with the data flow so that the data get from one end to the other faster, with less code imprint. I've never had them run on anything beside my Q8200, and I'm wondering on the behavior of modern cpus. I'd appreciate if you (or anyone else) can run a benchmark (baseline) + these 2, and perhaps a time ./tests as a more real-world performance.

If I do a ./time tests, both run faster by a second (58.2 seconds baseline with endomorphism down to 57.2 seconds in my underclocked Q8200 @ 1.86gz), although the memory version seems faster in the benchmarks. I have a theory on why the xmm version sucks in benchmarks (OS context switches being more expensive for also saving the xmm reg set?) but the bottom line is it seems faster than baseline when doing a timed test run (more real-world application)... Security-wise, I wouldn't want to let data hanging around on the XMM registers though.

(*What I wanted to do is to reduce opcode size, instruction count and memory accesses by reducing the number of temporary variables from 3 to 2 or 1, while interleaving muls with adds).


Version 1 - normal/memory:

Code:
/**********************************************************************
 * Copyright (c) 2013-2014 Diederik Huys, Pieter Wuille               *
 * Distributed under the MIT software license, see the accompanying   *
 * file COPYING or http://www.opensource.org/licenses/mit-license.php.*
 **********************************************************************/

/**
 * Changelog:
 * - March 2013, Diederik Huys:    original version
 * - November 2014, Pieter Wuille: updated to use Peter Dettman's parallel multiplication algorithm
 * - December 2014, Pieter Wuille: converted from YASM to GCC inline assembly
 */

#ifndef _SECP256K1_FIELD_INNER5X52_IMPL_H_
#define _SECP256K1_FIELD_INNER5X52_IMPL_H_

SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint64_t *r, const uint64_t *a, const uint64_t * SECP256K1_RESTRICT b) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            r15:rcx = d
 *            r10-r14 = a0-a4
 *            rbx     = b
 *            rdi     = r
 *            rsi     = a / t?
 */
  uint64_t tmp1, tmp2;
__asm__ __volatile__(
    "movq 24(%%rsi),%%r13\n"
    "movq 0(%%rbx),%%rax\n"
    "movq 32(%%rsi),%%r14\n"
    /* d += a3 * b0 */
    "mulq %%r13\n"
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq %%rax,%%r9\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 8(%%rbx),%%rax\n"
    "movq %%rdx,%%rsi\n"
    /* d += a2 * b1 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b2 */
    "mulq %%r11\n"
    "movq $0x1000003d10,%%rcx\n"
    "movq $0xfffffffffffff,%%r15\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d = a0 * b3 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* c = a4 * b4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "shrdq $52,%%rdx,%%r8\n"     /* c >>= 52 (%%r8 only) */
    /* d += (c & M) * R */
    "andq %%r15,%%rax\n"
    "mulq %%rcx\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t3 (tmp1) = d & M */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%q1\n"  
    /* d >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* d += a4 * b0 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b1 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b2 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b3 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a0 * b4 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
     /* d += c * R */
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    "mulq %%r8\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t4 = d & M (%%r15) */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rax,%%r15\n"
    "shrq $48,%%r15\n" /*Q3*/
    
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rdx\n"
    "andq %%rdx,%%rax\n"
    "movq %%rax,%q2\n"
    /*"movq %q2,%%r15\n" */
    "movq 0(%%rbx),%%rax\n"
    /* c = a0 * b0 */
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += a4 * b1 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b2 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b3 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b4 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    
    "movq %%r15,%%rax\n"  /*Q3 transfered*/
    
    /* u0 = d & M (%%r15) */
    "movq %%r9,%%rdx\n"
    "shrdq $52,%%rsi,%%r9\n"
    "movq $0xfffffffffffff,%%r15\n"
    "xor %%esi, %%esi\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */

    /* u0 = (u0 << 4) | tx (%%r15) */
    "shlq $4,%%rdx\n"
    "orq %%rax,%%rdx\n"
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a1 * b0 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b1 */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b2 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b3 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b4 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a2 * b0 */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a1 * b1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b2 (last use of %%r10 = a0) */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    /* fetch t3 (%%r10, overwrites a0), t4 (%%r15) */
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b3 */
    "mulq %%r14\n"
    "movq %q1,%%r10\n"
    "xor %%esi, %%esi\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b4 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq $0x1000003d10,%%r11\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 (%%r9 only) */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %q2,%%rsi\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t3 */
    "xor %%ecx,%%ecx\n"
    "movq %%r9,%%rax\n"
    "addq %%r10,%%r8\n"
    /* c += d * R */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%rsi,%%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
: "+S"(a), "=m"(tmp1), "=m"(tmp2)
: "b"(b), "D"(r)
: "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

SECP256K1_INLINE static void secp256k1_fe_sqr_inner(uint64_t *r, const uint64_t *a) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            rcx:rbx = d
 *            r10-r14 = a0-a4
 *            r15     = M (0xfffffffffffff)
 *            rdi     = r
 *            rsi     = a / t?
 */
  uint64_t tmp1a;
__asm__ __volatile__(
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 24(%%rsi),%%r13\n"
    "movq 32(%%rsi),%%r14\n"
    "leaq (%%r10,%%r10,1),%%rax\n"
    "movq $0xfffffffffffff,%%r15\n"
    /* d = (a0*2) * a3 */
    "mulq %%r13\n"
    "movq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += (a1*2) * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"
    "movq %%r14,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c = a4 * a4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "movq %%rdx,%%r9\n"
    /* d += (c & M) * R */
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r9,%%r8\n"
    /* t3 (tmp1) = d & M */
    "movq %%rbx,%%rsi\n"
    "andq %%r15,%%rsi\n" /*Q1 became rsi*/
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    /* a4 *= 2 */
    "movq %%r10,%%rax\n"
    "addq %%r14,%%r14\n"
    /* d += a0 * a4 */
    "mulq %%r14\n"
    "xor %%ecx,%%ecx\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d+= (a1*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a2 * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"

    /* d += c * R */
    "movq %%r8,%%rax\n"
    "movq $0x1000003d10,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    "mulq %%r8\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* t4 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rdx,%%r15\n"
    "shrq $48,%%r15\n" /*Q3=R15*/
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    "movq %%rdx,%q1\n"/*Q2 OUT - renamed to q1*/
    /* c = a0 * a0 */
    "movq %%r10,%%rax\n"
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%rdx,%%r9\n"
    /* d += a1 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r12,%%r12,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += (a2*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* u0 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "movq $0xfffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* u0 = (u0 << 4) | tx (%%rsi) */
    "shlq $4,%%rdx\n"
    "orq %%r15,%%rdx\n" /*Q3 - R15 RETURNS*/
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "movq $0xfffffffffffff,%%r15\n" /*R15 back in its place*/
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"    
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* a0 *= 2 */
    "addq %%r10,%%r10\n"
    /* c += a0 * a1 */
    "movq %%r10,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a2 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a3 * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%r10,%%rax\n"
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* c += a0 * a2 (last use of %%r10) */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %q1,%%r12\n" /*Q2 RETURNS*/
    "adcq %%rdx,%%r9\n"
    /* fetch t3 (%%r10, overwrites a0),t4 (%%rsi) */
    /*"movq %q1,%%r10\n" */
    /* c += a1 * a1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a3 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq $0x1000003d10,%%r13\n"
    "mulq %%r13\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 (%%rbx only) */
    "shrdq $52,%%rcx,%%rbx\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r14,%%r14\n"
    /* c += t3 */
    "movq %%rbx,%%rax\n"
    "addq %%rsi,%%r8\n" /*RSI = Q1*/
    /* c += d * R */
    "mulq %%r13\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r14\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r14,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%r12, %%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
: "+S"(a), "=m"(tmp1a)
: "D"(r)
: "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

#endif


Version 2 - more xmm reg use

Code:
/**********************************************************************
 * Copyright (c) 2013-2014 Diederik Huys, Pieter Wuille               *
 * Distributed under the MIT software license, see the accompanying   *
 * file COPYING or http://www.opensource.org/licenses/mit-license.php.*
 **********************************************************************/

/**
 * Changelog:
 * - March 2013, Diederik Huys:    original version
 * - November 2014, Pieter Wuille: updated to use Peter Dettman's parallel multiplication algorithm
 * - December 2014, Pieter Wuille: converted from YASM to GCC inline assembly
 */

#ifndef _SECP256K1_FIELD_INNER5X52_IMPL_H_
#define _SECP256K1_FIELD_INNER5X52_IMPL_H_

SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint64_t *r, const uint64_t *a, const uint64_t * SECP256K1_RESTRICT b) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            r15:rcx = d
 *            r10-r14 = a0-a4
 *            rbx     = b
 *            rdi     = r
 *            rsi     = a / t?
 */
/* xmm0 = q1 xmm6=q2    */
/* This has 17 mem accesses + 17 xmm uses vs 35 mem access and no xmm use*/

__asm__ __volatile__(
    "push %%rbx\n"
    "movq %%rsp, %%xmm1\n"
    "movq %%rbp, %%xmm2\n"
    "movq %%rdi, %%xmm3\n"
    "movq 0(%%rbx),%%rdi\n"
    "movq 8(%%rbx),%%rbp\n"
    "movq 16(%%rbx),%%rsp\n"
    "movq %%rdi,%%xmm4\n"
    
    "movq 24(%%rsi),%%r13\n"
    "movq %%rdi,%%rax\n"
    "movq 32(%%rsi),%%r14\n"
    /* d += a3 * b0 */
    "mulq %%r13\n"
    "movq 0(%%rsi),%%r10\n"
    "movq %%rax,%%r9\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq %%rbp,%%rax\n"
    "movq %%rdx,%%rsi\n"
    /* d += a2 * b1 */
    "mulq %%r12\n"
    "movq 24(%%rbx),%%rcx\n"
    "movq 32(%%rbx),%%rbx\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b2 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d = a0 * b3 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* c = a4 * b4 */
    "mulq %%r14\n"
    "movq $0xfffffffffffff,%%r15\n"
    "movq %%rax,%%r8\n"
    /* d += (c & M) * R */
    "andq %%r15,%%rax\n"
    "shrdq $52,%%rdx,%%r8\n"     /* c >>= 52 (%%r8 only) */
    "movq $0x1000003d10,%%rdx\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t3 (tmp1) = d & M */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%%xmm0\n"  
    /* d >>= 52 */
    "movq %%rdi,%%rax\n"
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* d += a4 * b0 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b1 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b2 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b3 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a0 * b4 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
     /* d += c * R */
    "movq $0x1000003d10,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    "mulq %%r8\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t4 = d & M (%%r15) */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rax,%%r15\n"
    "shrq $48,%%r15\n" /*Q3*/
    
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rdx\n"
    "andq %%rdx,%%rax\n"
    "movq %%rax,%%xmm6\n"
    /*"movq %q2,%%r15\n" */
    "movq %%rdi,%%rax\n"
    /* c = a0 * b0 */
    "mulq %%r10\n"
    "movq %%rcx,%%xmm5\n"
    "movq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += a4 * b1 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b2 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b3 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b4 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    
    "movq %%r15,%%rax\n"  /*Q3 transfered*/
    
    /* u0 = d & M (%%r15) */
    "movq %%r9,%%rdx\n"
    "shrdq $52,%%rsi,%%r9\n"
    "movq $0xfffffffffffff,%%r15\n"
    "xor %%esi, %%esi\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */

    /* u0 = (u0 << 4) | tx (%%r15) */
    "shlq $4,%%rdx\n"
    "orq %%rax,%%rdx\n"
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%%rdx\n"
        /* c >>= 52 */
    "movq %%rdi,%%rax\n"
    "movq %%xmm3, %%rdi\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    "movq %%rdx,0(%%rdi)\n"
    /* c += a1 * b0 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b1 */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b2 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b3 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b4 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%xmm4,%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a2 * b0 */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a1 * b1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b2 (last use of %%r10 = a0) */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    /* fetch t3 (%%r10, overwrites a0), t4 (%%r15) */
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b3 */
    "mulq %%r14\n"
    "movq %%xmm0,%%r10\n"
    "xor %%esi, %%esi\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b4 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq $0x1000003d10,%%rbx\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rbx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 (%%r9 only) */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t3 */
    "movq %%r9,%%rax\n"
    "addq %%r10,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += d * R */
    "mulq %%rbx\n"
    "movq %%xmm1, %%rsp\n"
    "movq %%xmm2, %%rbp\n"
    "movq %%xmm6,%%rsi\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%rsi,%%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
    "pop %%rbx\n"
: "+S"(a)
: "b"(b), "D"(r)
: "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

SECP256K1_INLINE static void secp256k1_fe_sqr_inner(uint64_t *r, const uint64_t *a) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            rcx:rbx = d
 *            r10-r14 = a0-a4
 *            r15     = M (0xfffffffffffff)
 *            rdi     = r
 *            rsi     = a / t?
 */
/* tmp1a = xmm0 */
__asm__ __volatile__(
    "movq %%rsp, %%xmm1\n"
    "movq %%rbp, %%xmm2\n"
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 24(%%rsi),%%r13\n"
    "movq 32(%%rsi),%%r14\n"
    "leaq (%%r10,%%r10,1),%%rax\n"
    "movq $0xfffffffffffff,%%r15\n"
    /* d = (a0*2) * a3 */
    "mulq %%r13\n"
    "movq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += (a1*2) * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"
    "movq %%r14,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c = a4 * a4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "movq %%rdx,%%r9\n"
    /* d += (c & M) * R */
    "movq $0x1000003d10,%%rsp\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r9,%%r8\n"
    /* t3 (tmp1) = d & M */
    "movq %%rbx,%%rsi\n"
    "andq %%r15,%%rsi\n" /*Q1 OUT*/
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    /* a4 *= 2 */
    "movq %%r10,%%rax\n"
    "addq %%r14,%%r14\n"
    /* d += a0 * a4 */
    "mulq %%r14\n"
    "xor %%ecx,%%ecx\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d+= (a1*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a2 * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"

    /* d += c * R */
    "movq %%r8,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    "mulq %%rsp\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* t4 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rdx,%%rbp\n"
    "shrq $48,%%rbp\n" /*Q3 OUT*/
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    "movq %%rdx,%%xmm0\n"/*Q2 OUT*/
    /* c = a0 * a0 */
    "movq %%r10,%%rax\n"
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%rdx,%%r9\n"
    /* d += a1 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r12,%%r12,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += (a2*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* u0 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "movq %%r15,%%rax\n"
    "andq %%rax,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* u0 = (u0 << 4) | tx (%%rsi) */
    "shlq $4,%%rdx\n"
    "orq %%rbp,%%rdx\n" /*Q3 RETURNS*/
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"    
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* a0 *= 2 */
    "addq %%r10,%%r10\n"
    /* c += a0 * a1 */
    "movq %%r10,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a2 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a3 * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%r10,%%rax\n"
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* c += a0 * a2 (last use of %%r10) */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%xmm0,%%r12\n" /*Q2 RETURNS*/
    "adcq %%rdx,%%r9\n"
    /* fetch t3 (%%r10, overwrites a0),t4 (%%rsi) */
    /*"movq %q1,%%r10\n" */
    /* c += a1 * a1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a3 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 (%%rbx only) */
    "shrdq $52,%%rcx,%%rbx\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r14,%%r14\n"
    /* c += t3 */
    "movq %%rbx,%%rax\n"
    "addq %%rsi,%%r8\n" /*RSI = Q1 RETURNS*/
    /* c += d * R */
    "mulq %%rsp\n"
    "movq %%xmm1, %%rsp\n"
    "movq %%xmm2, %%rbp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r14\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r14,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%r12, %%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"

: "+S"(a)
: "D"(r)
: "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

#endif

staff
Activity: 4326
Merit: 8951
January 14, 2017, 10:14:27 AM
#19
Strange. I was under the impression that

allowed me a tiny little bit of moaning.
It's a trivial optimization that already existed in three other places in the code. Thanks for noticing that it hadn't been performed there, and providing code for one of the two places it needed to be improved, but come on-- you're just making yourself look foolish here with the rude attitude while you're clearly fairly ignorant about what you're talking about overall. Case in point:

The endormorphism makes verification and ECDH significantly faster. It doesn't do anything else beyond additional endomorphism related tests.  

It does not make pubkey generation faster and really can't (well it could be used with some effort to halve the size of the in-memory tables at a small performance penalty.).

It's a little absurd that you insult a ~27% performance increase to verification while bragging about a under-half-percent change to verification performance.  It seems to me that you're trying to compensate for ignorance by insulting a lot, it might fool a few people who just don't know much of anything-- but not anyone else.  And it prevents you from learning. I do think wanna-be applies, in spades, and if you keep with that attitude it will probably continue to do so.

It wasn't 'the collective'.
Sipa wrote the entire library,  from scratch,  all by himself.
That is far from an accurate history, but it doesn't matter-- Pieter did do the lionshare of the work but he didn't do it in isolation. But less fortunate, is where your internal imagination continues and you write:

Quote
Which means that if you want to change any bit there,  you have to play the bitcoin celebrities PR game, which is mostly about indulging a  big egos of a few funny characters.
Which is just something you're purely making up. And it's unfortunate because if other people read it and don't know that you're extrapolating based on your imagination and fears they might not contribute where they otherwise might. That does the world a disservice.

Quote
And you can't seriously expect from and coder  to work on his personal  hobby project and  then deliver it with the industry standards documentation.
What kind of documentation would you even expect from a lib that provides a simple EC math functions?  The function names are descriptive enough if you understand the operations they provide. And if you don't understand the math behind it then no document is going to help you anyway.

But what is strange is that it is _extensively_ documented.

E.g. above rico666 commented on secp256k1_fe_cmov -- even if someone were ignorant enough of the subject and the conventions used to not immediately know that this function performed a conditional move of a field element, or of programming enough to not know what a conditional move was; there is documentation (in this case apparently added by me):

Code:
/** If flag is true, set *r equal to *a; otherwise leave it. Constant-time. */
static void secp256k1_fe_cmov(secp256k1_fe *r, const secp256k1_fe *a, int flag);

... even though this is purely internal code and is not accessible to an end user of the library.
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 14, 2017, 03:21:41 AM
#18
There is no question that at this moment sipa's secp256k1 lib is the fastest solution on the market.

And you can complain  all you want about messy coding or poor documentation, but unless you provide an alternative to prove that it can be done so much better... well,  then it's just going to be a moaning.

And just a moaning isn't very professional.

Strange. I was under the impression that

Code:
field_get_b32: min 0.647us / avg 0.666us / max 0.751us
field_set_b32: min 0.551us / avg 0.571us / max 0.624us

becomes

field_get_b32: min 0us / avg 0.0000000477us / max 0.000000238us
field_set_b32: min 0us / avg 0.0000000238us / max 0.000000238us

allowed me a tiny little bit of moaning.


Rico
legendary
Activity: 2058
Merit: 1416
aka tonikt
January 13, 2017, 05:12:09 PM
#17
There is no question that at this moment sipa's secp256k1 lib is the fastest solution on the market.

And you can complain  all you want about messy coding or poor documentation, but unless you provide an alternative to prove that it can be done so much better... well,  then it's just going to be a moaning.

And just a moaning isn't very professional.

Sipa didn't go to openssl forum saying how shitty their implementation was - he just made a better one,  to prove the point. And he proved it so well that now nobody bothers to make an effort in beating him. Smiley
legendary
Activity: 2058
Merit: 1416
aka tonikt
January 13, 2017, 04:46:57 PM
#16

Sipa you say? Why did he abandon the project? Was it just some proof of work?

Proof of work that have already found so many applications, including one in your project. I guess you can call it however you like.

I don't know him and can't speak  for him,  but I wouldn't say he abandoned it. Rather decided it was complete enough and moved on with his life, like we do all the time.  Even satoshi did that with his big project.
And you're also not going to work on a single project all your life,  are you?
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 13, 2017, 01:35:04 PM
#15
There are 3 benchmarks

bench_internal
bench_verify
bench_sign

which are built by ./configure --enable-benchmark

As for the difference in test speed, might have to do with some lines in tests.c which indicate a different number of rounds (plus tests for endomorphism) if endomorphism is on.

Built your program with endomorphism (./configure --enable-endomorphism) and report back with the results, should be faster.

Ok - I did.

bench_verify shows speedup with endomorphism

ecdsa_verify: min 42.0us / avg 42.2us / max 43.0us  (with)
ecdsa_verify: min 57.7us / avg 57.8us / max 58.4us  (without)

bench_internal shows no improvements (within measure tolerance) except one:

wnaf_const: min 0.0887us / avg 0.0920us / max 0.102us (with)
wnaf_const: min 0.155us / avg 0.161us / max 0.171us     (without)

I doubt this would cause the speedup from above.


Rico
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 13, 2017, 01:13:06 PM
#14
And you can't seriously expect from and coder  to work on his personal  hobby project and  then deliver it with the industry standards documentation.

Depends on the coder - I guess. As long as I have interest in my hobby project, I want it to be perfect. I for one have no problem to admit that my LBC projects still sucks badly in many places right now. I intend to improve. Documentation, Ease of use, Speed, One of them is to get decent EC performance on a GPU - that's why I am looking into this at all.
 
Quote
What kind of documentation would you even expect from a lib that provides a simple EC math functions?  The function names are descriptive enough if you understand the operations they provide. And if you don't understand the math behind it then no document is going to help you anyway.

Well - for me the names are pretty bad.  "secp256k1_fe_cmov" *thumbs up* But it's not only that. The data structures are pretty sideways too. Actually I understand the math pretty well, that's why I am so puzzled about what I see - still unsure if the lib is seriously doing what I think it's doing. I don't think the person who wrote this did really care for performance - he probably just wanted something that sucked less.

Sipa you say? Why did he abandon the project? Was it just some proof of work?


Rico
legendary
Activity: 2058
Merit: 1416
aka tonikt
January 13, 2017, 08:34:34 AM
#13
It wasn't 'the collective'.
Sipa wrote the entire library,  from scratch,  all by himself.
The guys just took his code,  adding some pretty useless checks and heavy building system around it,  and now it's 'officially' hosted from bitcoin/secp256k1, as the 'community project' . Which means that if you want to change any bit there,  you have to play the bitcoin celebrities PR game, which is mostly about endulging a  big egos of a few funny characters.

But if you check the history of sipa/secp256k1 you can see that it used to quite easy to commit optimizations into that code.

It was all done by one person as his personal,  partially experimental  project and I personally admire the work.
And you can't seriously expect from and coder  to work on his personal  hobby project and  then deliver it with the industry standards documentation.
What kind of documentation would you even expect from a lib that provides a simple EC math functions?  The function names are descriptive enough if you understand the operations they provide. And if you don't understand the math behind it then no document is going to help you anyway.
legendary
Activity: 1708
Merit: 1049
January 13, 2017, 07:30:47 AM
#12
One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?

I don't think patents are any problem with the endomorphism code. The code itself is the problem. Not sure which benchmarks you are referring to, but if I take a (very coarse) look on benchmarks on my system, USE_ENDOMORPHISM is nothing you'd like to enable:

Code:
Times for tests:

gcc version 6.3.1 20170109 (GCC)

1) CFLAGS -g -O2
real    0m14.365s
user    0m14.357s
sys     0m0.007s

2) CFLAGS -O3 -march=sklake
real    0m13.549s
user    0m13.547s
sys     0m0.000s

3) CFLAGS -O3 -march=sklake & USE_ENDOMORPHISM 1
real    0m15.660s
user    0m15.660s
sys     0m0.000s

4) CFLAGS -g -O2 & USE_ENDOMORPHISM 1
real    0m16.139s
user    0m16.137s
sys     0m0.000s

5) CFLAGS -g -O2 & undef USE_ASM_X86_64
real    0m14.849s
user    0m14.847s
sys     0m0.000s

6) CFLAGS -O3 -march=sklake & undef USE_ASM_X86_64
real    0m14.520s
user    0m14.517s
sys     0m0.000s

So yes, the beef seems to be in better assembler code and ditching endomorphism.
On modern CPUs, ditch that old gcc too and use -O3 (forget what you've heard about it in the past years).

Rico


There are 3 benchmarks

bench_internal
bench_verify
bench_sign

which are built by ./configure --enable-benchmark

As for the difference in test speed, might have to do with some lines in tests.c which indicate a different number of rounds (plus tests for endomorphism) if endomorphism is on.

Built your program with endomorphism (./configure --enable-endomorphism) and report back with the results, should be faster.
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 13, 2017, 06:54:48 AM
#11
One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?

I don't think patents are any problem with the endomorphism code. The code itself is the problem. Not sure which benchmarks you are referring to, but if I take a (very coarse) look on benchmarks on my system, USE_ENDOMORPHISM is nothing you'd like to enable:

Code:
Times for tests:

gcc version 6.3.1 20170109 (GCC)

1) CFLAGS -g -O2
real    0m14.365s
user    0m14.357s
sys     0m0.007s

2) CFLAGS -O3 -march=sklake
real    0m13.549s
user    0m13.547s
sys     0m0.000s

3) CFLAGS -O3 -march=sklake & USE_ENDOMORPHISM 1
real    0m15.660s
user    0m15.660s
sys     0m0.000s

4) CFLAGS -g -O2 & USE_ENDOMORPHISM 1
real    0m16.139s
user    0m16.137s
sys     0m0.000s

5) CFLAGS -g -O2 & undef USE_ASM_X86_64
real    0m14.849s
user    0m14.847s
sys     0m0.000s

6) CFLAGS -O3 -march=sklake & undef USE_ASM_X86_64
real    0m14.520s
user    0m14.517s
sys     0m0.000s

So yes, the beef seems to be in better assembler code and ditching endomorphism.
On modern CPUs, ditch that old gcc too and use -O3 (forget what you've heard about it in the past years).

Rico
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 13, 2017, 04:55:40 AM
#10
Can you show another library for the same application within a factor of _five_ of the performance?  Or with more than 1/5th the documentation?

I'm just ... beside myself at your comment, it's not even insulting: it's just too absurd.

Good. We seem to live in different worlds then, each others world seeming absurd to the other one.

I for one, find your "argument" of requesting to be shown another lib doing the same thing with 5 perf 1/5 docs quite absurd. It's like claiming that anything in the world by definition cannot suck when there is nothing else that sucks less. Really absurd.

Quote
You might have noticed that the constant unpacking macro does effectively the same thing: https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52.h#L22 but a word at a time.

No I haven't, because reading the secp256k1 code is a major PITA, but thanks for the pointer. That wheel would probably be the next thing I'd reinvented.

Quote
If you open a PR on the function you should add it to the bench_internal benchmarks as you go.

What's a PR? Press release? Ah. pull request I suppose. You're assuming too much. E.g. that the whole world runs on - or is - git.

Quote
Quote
Is there any place where R&D discussion about further development is ongoing? Before I'd start re-implementing that mess from scratch I'd prefer to participate in some "official endeavor".

same place it's always been https://github.com/bitcoin-core/secp256k1/

If that is the "discussion" place, no wonder I didn't see it. Srsly?

Quote
Disinterest in spoon feeding people who sound like wanna-be thieves who thefts would scare ordinary people away from Bitcoin and whom can't even be bothered to RTFM (there has been fairly detailed documentation for the library for years) is not evidence of a lack of interest in further development.

My dear young gmaxwell: Evidently, after you did the diligent work opening a PR  Smiley, benchmarked the code in the process, reformatting it for the worse, but reordering it for the better, found out and stated that my code is about 6-7 orders of magnitude faster than the original code (and I am not primarily a C hacker) ... you have the guts to use the term "wanna-be" when addressing any of your texts towards me?

Let alone the fact that there was such a tremendously suboptimal code for such a long time should teach you something. Let me assure you, your perception of the situation here is skewed at best. From what I see in the secp256k1 lib, the collective who did the job are good programmers with potential. Motivated, young, inexperienced, but potential.

If it wasn't for the LBC hobby project of mine, I would have never had looked in the mess that is the secp256k1 library. I did and I commented. You don't like the comment, maybe feel it being insulting, puzzling or absurd. Fine. I have high hopes that before you are in your mid-40ties you will understand what my comment was about. As I said: "potential".

Rationalizing the poor state of an open source project is something I came across in the Linux world since the 90ties. To me, your statements are neither new nor original. So should you - some day - come to the conclusion that you could attract a certain kind of programmers when the open source software reaches a certain kind of quality, it'd be swell. Let me tell you it's not about "spoon feeding people who sound like wanna-be thieves". It's about preparing the ground for people who may have twice your experience and otherwise scoff at the project.

No offence - ofc.

Rico
staff
Activity: 4326
Merit: 8951
January 09, 2017, 06:50:01 AM
#9
I am also interested in a faster secp256k1.

Unfortunately, it seems Pieter Wuille et al. were neither serious about providing a fast secp256k1 nor about documenting it well.  Roll Eyes

Can you show another library for the same application within a factor of _five_ of the performance?  Or with more than 1/5th the documentation?

I'm just ... beside myself at your comment, it's not even insulting: it's just too absurd.

Any optimizations are very welcome - I'm sure sipa (the original author) would agree. He's a very nice guy and it's easy to contact him directly (probably best via IRC). Although, about the code used by the core,  he'll probably tell you that it's being maintained by other people now.
Uh, what?


In 5x52 asm I found 2-3% by simply doing what the cpu scheduler should be doing and issuing together the adds + muls (the cpu integer unit typically has one add and one mul unit and ideally we want to be using them at the same time).
Awesome! PRs welcome!

One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?
It is potentially patent encumbered for another couple years, restricting it to experimental use.

Quote
So the adcq+mulq are issued together to the mul and add units of the cpu respectively. I'm baffled on why the cpu scheduler wasn't already doing this but then again I do have an older cpu (core2 quad / 45nm) to play with - it might not be an issue with modern ones.
Generally newer cpus work better, but the performance is also more important on older (and in-order cpus like the atoms) since they're slower to begin with.
legendary
Activity: 2058
Merit: 1416
aka tonikt
January 08, 2017, 08:14:55 AM
#8
The original secp256k1 lib is a great piece of work,  but it surely can be improved even more.

Any optimizations are very welcome - I'm sure sipa (the original author) would agree. He's a very nice guy and it's easy to contact him directly (probably best via IRC). Although, about the code used by the core,  he'll probably tell you that it's being maintained by other people now.

Otherwise, just publish your improvements  here - even if not Bitcoin Core,  someone else will use them, trust me,  because time is money Smiley
legendary
Activity: 1708
Merit: 1049
January 08, 2017, 06:41:37 AM
#7
I am also interested in a faster secp256k1.

I've been tampering with the asm a bit and had some luck, especially in scalar performance by merging the various stages into one function, however do note that it's gcc only (clang had some issues with rsi/rdi use on the manually inlined function and to work around it one has to move rsi or rdi to some xmm register and restore it before the function ends - which adds a few cycles)

https://github.com/Alex-GR/secp256k1/blob/master/src/scalar_4x64_impl.h
https://github.com/Alex-GR/secp256k1/blob/master/src/field_5x52_asm_impl.h

In 5x52 asm I found 2-3% by simply doing what the cpu scheduler should be doing and issuing together the adds + muls (the cpu integer unit typically has one add and one mul unit and ideally we want to be using them at the same time).

For example:

/* d += a3 * b1 */
    "movq 8(%%rbx),%%rax\n"
    "mulq %%r13\n"
    "addq %%rax,%%rcx\n"
    "adcq %%rdx,%%r15\n"
    /* d += a2 * b2 */
    "movq 16(%%rbx),%%rax\n" <=== this must be moved upwards
    "mulq %%r12\n"

becomes:

/* d += a3 * b1 */
    "movq 8(%%rbx),%%rax\n"
    "mulq %%r13\n"
    "addq %%rax,%%rcx\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%r15\n"
    /* d += a2 * b2 */
    "mulq %%r12\n"

So the adcq+mulq are issued together to the mul and add units of the cpu respectively. (edit apparently the mul+add are in the simd unit, in the integer part it's just 3 integer units waiting to do parallel work of any kind - except load/store, which is 2 at a time). I'm baffled on why the cpu scheduler wasn't already doing this but then again I do have an older cpu (core2 quad / 45nm) to play with - it might not be an issue with modern ones.

Also, of the three temp variables used for temp storage in the field asm, some can be eliminated by rearranging the code a bit and thus reducing a couple memory accesses.

(ps. I don't claim the code is safe - I've only used it for benchmarks and the builtin test)

One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 08, 2017, 05:22:21 AM
#6
I am also interested in a faster secp256k1.

Unfortunately, it seems Pieter Wuille et al. were neither serious about providing a fast secp256k1 nor about documenting it well.  Roll Eyes

I have done some hacking to some pretty basic functions in there

https://bitcointalksearch.org/topic/m.17365068

- which alone made the LBC generator about 10% faster overall -
and later also changing the field_5x52 code in secp256k1_fe_set_b32 to

Code:
    r->n[0] = (uint64_t)a[31]
            | (uint64_t)a[30] << 8
            | (uint64_t)a[29] << 16
            | (uint64_t)a[28] << 24
            | (uint64_t)a[27] << 32
            | (uint64_t)a[26] << 40
            | (uint64_t)(a[25] & 0xF)  << 48;

    r->n[1] = (uint64_t)((a[25] >> 4) & 0xF)
            | (uint64_t)a[24] << 4
            | (uint64_t)a[23] << 12
            | (uint64_t)a[22] << 20
            | (uint64_t)a[21] << 28
            | (uint64_t)a[20] << 36
            | (uint64_t)a[19] << 44;

    r->n[2] = (uint64_t)a[18]
            | (uint64_t)a[17] << 8
            | (uint64_t)a[16] << 16
            | (uint64_t)a[15] << 24
            | (uint64_t)a[14] << 32
            | (uint64_t)a[13] << 40
            | (uint64_t)(a[12] & 0xF) << 48;

    r->n[3] = (uint64_t)((a[12] >> 4) & 0xF)
            | (uint64_t)a[11] << 4
            | (uint64_t)a[10] << 12
            | (uint64_t)a[9]  << 20
            | (uint64_t)a[8]  << 28
            | (uint64_t)a[7]  << 36
            | (uint64_t)a[6]  << 44;

    r->n[4] = (uint64_t)a[5]
            | (uint64_t)a[4] << 8
            | (uint64_t)a[3] << 16
            | (uint64_t)a[2] << 24
            | (uint64_t)a[1] << 32
            | (uint64_t)a[0] << 40;

I have been pointed to https://github.com/llamasoft/secp256k1_fast_unsafe but am not sure if that is still maintained/developed.
Is there any place where R&D discussion about further development is ongoing? Before I'd start reimplementing that mess from scratch I'd prefer to participate in some "official endeavor".

However, I have little hope that will make sense:

Quote
Signature verification isn't really the limiting factor in Bitcoin Core performance anymore in any case.

Together with other statements from gmaxwell @ github ("this is alpha/, don't expect other doc shan source.." - something like that from memory) I see there seems not much motivation in further development from the "official side".


Rico
staff
Activity: 4326
Merit: 8951
January 06, 2017, 06:36:11 AM
#5
I thought signature verification was the biggest limiting factor in IBD process (obviously with slow CPU). 
On arm, which doesn't yet use the libsecp256k1 assembly optimizations by default-- perhaps.  Otherwise, no, not since libsecp256k1 made signature validation many times faster. Maybe on some really unbalanced system with a slow single core cpu and huge dbcache and fast network and SSD it might be a majority.
legendary
Activity: 1948
Merit: 2097
January 06, 2017, 06:27:40 AM
#4
Signature verification isn't really the limiting factor in Bitcoin Core performance anymore in any case.

I thought signature verification was the biggest limiting factor in IBD process (obviously with slow CPU). 
staff
Activity: 4326
Merit: 8951
January 06, 2017, 04:41:44 AM
#3
The content in your first link would likely not be helpful.

The content in your second link-- the  would be beneficial, though potentially not by much because the dependency chain in the arithmetic is hand constructed to reduce conflicts.  If someone would like to try them out, it shouldn't be very hard.

In the context of Bitcoin development (thus achows' response), Signature verification isn't really the limiting factor in Bitcoin Core performance anymore in any case.
staff
Activity: 3458
Merit: 6793
Just writing some code
January 05, 2017, 06:41:23 PM
#2
Probably not.
Jump to: