Pages:
Author

Topic: secp256k1 library and Intel cpu - page 2. (Read 4029 times)

legendary
Activity: 2053
Merit: 1356
aka tonikt
January 14, 2017, 01:37:18 PM
#25
I'm just explaining you how you are being perceived.
Whether your intentions were not to tell the guy of,  but maybe only to encourage him into a more productive way of thinking - that's a different story.

And no, I'm not going to look through the IRC logs from several years ago,  just to prove to myself that I haven't dreamed about something and it actually happened. Because,  believe me or not,  I seriously don't give a fuck whether you believe me or not Smiley

Why don't you want to discuss the protocol  changes needed for bootstrapping clients with utxo snapshots?
Because you obviously don't.
Please explain me: if it isn't about your ego,  then what is it about?
staff
Activity: 4284
Merit: 8808
January 14, 2017, 01:12:24 PM
#24
Repeating the some unsubstantiated claim doesn't make it substantiated.   You just made a claim about this very thread-- and yet when we look-- nothing behind it.

Quote
Just me, years ago, trying to talk on #bitcoin-dev about ways to speed up block downloading times. That was a no no - block propagation didn't need improving because you didn't think it was important... It almost got me banned from the channel..  And yet we are; 2016 - a new brilliant feature that the bitcoin core team is so proud of: compact fucking blocks!

The logs are all public, please point us to this conversation: http://bitcoinstats.com/irc/bitcoin-dev/logs/2011/01   If you can recall a name/phrase from the discussion, I will happily search for it.

Edit: Reading all logs from you in that channel only took a couple minutes, here is the conversation (trimming irrelevant side discussions):


10:02 < tonikt> Hi guys. I was wondering whether there have been any discussion on changes in the network protocol?
10:03 < tonikt> Like being able to download a block in fragments
10:03 < grau> tonkit: there is already bloom filter download for blocks
10:03 < tonikt> Or: see a transaction size, before downloading it
10:03 < sipa> tonikt: BIP37 allows that, in a way
10:03 < tonikt> grau: bloom filter is not good for full node
10:04 < grau> block fragments are no good for full node either
10:04 < tonikt> grau: why not?
10:04 < sipa> you can have two complementary bloom filters set on two connections
10:04 < sipa> and download half blocks from each
10:04 < tonikt> so the answer is 'no'?
10:05 < gmaxwell> tonikt: Not a very worthwhile thing in my opinion, and as sipa points out its already possible.
10:05 < grau> tonkit: a fragment can not prove that funds are not spent in the other part
10:05 < sipa> the answer is 'partially', BIP37 supports (some way) of downloading blocks in fragments
10:05 < tonikt> I believe the network could really use being able to download one block from diffetent peers in i.e. 32KB chunks
10:05 < sipa> that'd only slow things down imho
10:05 < tonikt> gmaxwell: but I was checking these bloom filters and it wasn't possible to download a block like that
10:05 < sipa> and there is no way to prove that a transaction has a certain size without actually sending that transactions
10:06 < sipa> tonikt: it is
10:06 < gmaxwell> tonikt: it is, but as sipa and I are telling you— it's not obviously useful.

10:06 < grau> tonkit: it is already allowed by the protocol to use several peers for download. BOP actually does that by block not by block fragment
10:06 < tonikt> sipa: yes - it isnt possible to confirm it, but if you see a mismatch you can at least ban the peer
10:06 < sipa> tonikt: set nHashFunctions=1, and use random complementary bits in the filter
10:06 < tonikt> grau: I al talking about a fully synchronized node
10:06 < gmaxwell> tonikt: what are you talking about?
10:07 < tonikt> you suddenly get an INV with a new block - you want to download it ASAP
10:07 < sipa> what is the actual problem you're trying to solve?
10:07 < tonikt> ... so why not to split the work into parts and downlaod it in paralell?
10:07 < sipa> we've just told you how to do that using BIP37

10:07 < tonikt> wait, I cannot read that BIP37 cause wiki is broken
10:08 < gmaxwell> Because doing so will _slow_ transmission except to the extent that it gets you an unfair share of the channel capacity.
10:08 < tonikt> maybe you are talking about a different thing Smiley
10:08 < sipa> 19:06:43 < sipa> tonikt: set nHashFunctions=1, and use random complementary bits in the filter
10:08 < sipa> tonikt: in practice it'd be very hard to coordinate that, but if block sizes would grow a lot, that may be a viable strategy
10:09 < tonikt> just to be clear: BIP37 was about downloading the header and the tx hashes, followed by the actual transactions?
10:09 < sipa> BIP37 = bloom filtering
10:09 < tonikt> yes
10:09 < tonikt> so how does it help me to split a block download amoung my 50 peers?
10:09 < sipa> so you give node A a random filter that selects 50% of the transactions
10:09 < sipa> and you give node B a complementary filter that selects the other 50% of the transactions
10:10 < tonikt> yes, but I have more than 2 peers.
10:10 < sipa> ok, then give node A a filter that selects 33% of the transactions
10:10 < tonikt> your solution seems more like a work around
10:10 < sipa> it's a very neat solution, as you don't need to keep track of what to download from whom
10:10 < tonikt> ok, sorry. let me ask a question then
10:10 < sipa> they figure it out themself
10:11 < grau> tonkit: and you seem to work around a problem not really there until blocks are the size we know.
10:11 < gmaxwell> tonikt: you're just going to end up in N connections all in slow start. Plus users setting your house on fire because you use _all_ their bandwidth in a burst once the connections come out of slowstart.
10:11 < sipa> maybe this will become an actual problem if the block size limit is increased
10:11 < sipa> if it is, we'll deal with it
10:11 < tonikt> why cant there be like a command "getsomething" that would return me the length of that something, plus a list of hashes of its data split into i.e. 4KB chunks
10:11 < tonikt> ... and that would be same for txs and blocks
10:12 < gmaxwell> sipa: sure if there are larger blocks then eventually at some point it makes sense. A few hundred K is really sketchy for any benefit there.
10:12 < tonikt> gmaxwell: no - now I use their bandwidth in a burst, asking each peer for entire block
10:12 < tonikt> I'm talking about a situation when we have a new block mined
10:12 < tonikt> which is every 10 minutes
10:12 < gmaxwell> You don't ask each peer for the entire block.
10:13 < tonikt> well, you should if you care to have it ASAP Smiley
10:13 < tonikt> I do Smiley
10:13 < gmaxwell> You ask a single peer for the entire block, or as sipa points out multiple peers for mutually exclusive subsets.
10:13 < tonikt> yes, I understand that is the status
10:13 < gmaxwell> tonikt: then you're abusing the network.
10:13 < gmaxwell> and also hurting your own performance.
10:13 < tonikt> but I was talking about a possible improvement
10:13 < tonikt> that is how bittorrent works
10:14 < tonikt> yes, call me an abuser Smiley
10:14 < sipa> do you use bittorrent for 1 MB data?
10:14 < tonikt> sometimes
10:14 < gmaxwell> sipa: normally torrents use 4-16 mb chunks.
10:14 < sipa> gmaxwell: i know
10:14 < tonikt> no
10:14 < gmaxwell> tonikt: what you're describing there sucks, because you can't tell who's screwed you and given you invalid data. With what sipa told you to do you can tell.
10:14 < sipa> it can use smaller chunks
10:14 < tonikt> I'm sure they use smaller ones - like 32KB
10:15 < sipa> gmaxwell said "normally"
10:15 < gmaxwell> tonikt: no, yes— it can, but thats not what actually gets used normally (except for tiny files, which torrent takes much longer to transfer than http)
10:15 < tonikt> OK, I get it - someone can send me a corrupt list of block's chunks hashes
10:15 < tonikt> but then...
10:15 < sipa> this is an engineering question, and it depends on very specific data (latency, bandwidth, variation, distribution)
10:16 < sipa> once it becomes a problem
10:16 < sipa> we can find an appropriate solution
10:16 < sipa> and there are several ways to deal with it

10:16 < tonikt> why don't you add a protocol command "give me this tx from this block?"
10:16 < sipa> because that would be absolutely horrible for the peers
10:16 < sipa> they need to look up the block for you, but only give you a small piece of it

10:16 < tonikt> sipa: with all due respect, but for you to find a solution, one needs to wait 2 years in average Smiley
10:17 < tonikt> It is actually quite easy to calculate
10:17 < tonikt> how much time you need to download lets say 1MB block
10:17 < tonikt> having 1Mbps connection - about 10 seconds
10:17 < sipa> right now the problem simply doesn't exist with 1 MB blocks, unless on very slow links (mobile?) where you don't want full blocks anyway
10:18 < tonikt> 10 senconds from peer to peer (and that's not counting checking it)
10:18 < tonikt> dont you think it is already enough to try decreasing it be a few folds?
10:20 < gmaxwell> tonikt: go convince bittorrent to never use chunks larger than 100k and come back. Tongue
10:20 < tonikt> I believe all the API is there already, except that it should be able to download a tx that is already mined, though by specifying a block where it is
10:20 < sipa> if you're on a 1 Mbps connection, there's no way you'll get it faster than in 10s anyway
10:20 < tonikt> gmaxwell: but bittorrent does not care about latency
10:20 < tonikt> while bitcoins hould
10:21 < sipa> if it's your peer that is limited to 1Mbps, but your connection is faster, you won't be downloading from him (as he'll be slower to announce it to you anyway)
10:21 < tonikt> ok, so you dont want to change the net protocol, before it becoming a problem
10:21 < gmaxwell> tonikt: generally parallel fetching is not great for latency.
10:21 < tonikt> fine
10:21 < gmaxwell> As you end up waiting on the most latent response before you can validate any of it.
10:22 < sipa> tonikt: i'm against changing the protocol before having clear information about the benefits
10:22 < sipa> and no, i don't think right now there is much that can be improved
10:22 < tonikt> gmaxwell: think if I could ask a node for a block and all its transaction hashes - and other nodes for transactions
10:22 < gmaxwell> tonikt: no, we're also saying that for the current maximum blocksize what you're suggesting is likely to _hurt_. Without a bunch of analysis it would be hard to know.
10:22 < sipa> in the future that can certainly change

10:22 < tonikt> I just dont see a problem with adding a block hash to inv while asking for tx data
10:23 < tonikt> it doesn't seem like a development challenge
10:23 < sipa> it's not
10:23 < tonikt> and it could solve the problem
10:23 < sipa> it's a maintainance overhead
10:23 < sipa> and a compatibility burden
10:23 < gmaxwell> We could also throw some virgins into volcanos, that might "solve the problem"
10:23 < gmaxwell> (not that you've established that there is even a problem to be solved)
10:24 < tonikt> ok, I get it - you don't see a problem
10:24 < sipa> *yet*
10:24 < tonikt> you are obviously not so much a perfectionists, as I am Tongue
10:24 < sipa> well we're not dealing with a nicely theoretical problem where the optimal soluton is obvious
10:24 < sipa> everything has downsides
10:24 < gmaxwell> tonikt: I don't see evidence of perfectionism in you in this discussion. If there were you'd be doing some careful analysis to establish some tests to determine the level of improvement possible.
10:25 < gmaxwell> Instead you're just shooting from the hip with a blind guess at something that would maybe help or hurt a problem which may exist in the future.
10:25 < tonikt> gmaxwell: all I can tell you is that, if I want to download a block ASAP, I ask each of my peers for the entire one - is this perfect for you?
10:26 < gmaxwell> tonikt: that won't actually fetch you the block faster than asking a single peer in many cases.

10:26 < tonikt> if I could download it in parts/transactions - that would be perfect, unless I'd screw up my implementation
10:26 < gmaxwell> No, in fact it wouldn't be.
10:26 < tonikt> Smiley no, it would
10:27 < sipa> you'd still have to wait for the slowest one to respond
10:27 < gmaxwell> tonikt: or look at it another way— why not ask them each for one bit of it?
10:27 < gmaxwell> as sipa says, you have to wait for the slowest response.
10:27 < sipa> if the only constraint is bandwidth, and processing speed and latency don't exist, your solution is optimal
10:27 < gmaxwell> sipa: and overhead doesn't exist.
10:27 < tonikt> guys, do I really need to explain you how to implement it?
10:27 < sipa> and attackers
10:28 < tonikt> you just have a list of txs to download and your peers - whenever any of them is busy, you ask him for the next ts
10:28 < gmaxwell> Right, so in a world where the only constrain is the remote peers bandwidth, and process speed, latency, attackers, and overhead don't exist— then indeed, thats optimal.
10:28 < tonikt> it must be faster
10:28 < sipa> how do you know he's busy?
10:28 < sipa> by waiting?
10:28 < tonikt> i know he is busy because he has not responded to my previous data request yet
10:29 < gmaxwell> It's unknowable without waiting because of latency.
10:29 < tonikt> so it does not make any sense to ask him for more data
10:29 < gmaxwell> tonikt: if he's 80ms away he _cannot_ answer faster than 80ms.
10:29 < tonikt> sure - but you have 1000 txs
10:29 < sipa> and you need all of them
10:29 < tonikt> eventually he will answer and you will have 900+ left anyway
10:29 < gmaxwell> so you're going to fetch them one at a time without pipelining? lol good luck with that.
10:29 < tonikt> and then you will ask him for the 913th one
10:29 < bmcgee> … I get the impression now's not a good time for asking potentially stupid questions …
10:30 < tonikt> gmaxwell: it's at least 200+ bytes
10:30 < gmaxwell> bmcgee: well, your question doesn't have a neat answer.
10:30 < sipa> tonikt: that is all very sensible, given the right bandwidth/latency tradeoffs
10:30 < tonikt> but youre right, it would be better to ask for several tx at the same time
10:30 < sipa> tonikt: it's something you certainly want to do at the ~megabyte level
10:30 < tonikt> except that some of thme may be 100kb big
10:30 < gmaxwell> tonikt: lol. just the TCP overhead from the request is going to instantly give you 50% overhead on 200 bytes.
10:30 < sipa> (and we don't by the way, so let's fix that first)
10:31 < tonikt> gmaxwell: now it gives you much more than 50%
10:31 < tonikt> most of the txs you have already anyway
10:31 < gmaxwell> tonikt: no, it doesn't the overhead on transfering a block is about 2%.
10:31 < sipa> tonikt: again BIP37 to the rescue (it doesn't send transactions it knows you already have, without extra latency)
10:32 < gmaxwell> okay, ignoring that. But as sipa says bip37 takes care of that.
10:33 < tonikt> gmaxwell: but using bip37 is working around a solution. if I need this tx from this block - why do I need to bother with bloom filters and statistics?
10:33 < sipa> tonikt: because it allows you to do things without extra latency
10:33 < sipa> tonikt: you don't have to be told about the list of transactions first, and you don't have to reply with which transactions you want
10:34 < gmaxwell> tonikt: statistics?! you set a series of complemetary bits. and don't have to taken another 80ms of round trip time to send extra tiny requests with redundant hashes.
10:34 < tonikt> sipa: but the biggest latency comes not from the ping - it somes from the data that needs to be finished, before they can be used
10:34 < gmaxwell> @#@*$(@#
10:34 < sipa> tonikt: on a 10 Mbit/s link, a not outrageous 200ms ping time (meaning 400ms extra for a roundtrip) means 400 kilobyte that could have been downloaded while they just waited for you to answer
10:34 < gmaxwell> tonikt: go look at the bandwidth numbers you gave before! on a 1mbit connection the data to transfer a fee hundred bytes is way less than typical latency.
10:35 < gmaxwell> er the time to transfer.
10:35 < sipa> that's more than an average block
10:36 < tonikt> ok guys, whatever. I see you have your world and don't really want to notice  mine.  I guess I will have to wait for it to become a problem Smiley

10:36 < tonikt> but if I might add something, not as a question, but as a proposal
10:37 < gmaxwell> Yes, my world has latency in it. Not sure where you get one that doesn't, but I'd like one of those. Tongue
10:37 < tonikt> 1) allow to ask for a size of a transaction/block before downloading it (so you can ban anyone who is trying to send you more)
10:37 < sipa> tonikt: what if the peer lies?
10:38 < gmaxwell> sipa: there are no attackers in tonikt's world.
10:38 < tonikt> 2) imagine that you are connected to 30 peers and a new 1MB block have just been mined: what is the fastest way to download it from your peers?
10:38 < sipa> depends on your bandwidth and latency
10:38 < gmaxwell> tonikt: if you think you can do better, just write a second transfer protocol. If it's better it should be easy to demonstrate. You'll probably learn something in the process.

10:38 < tonikt> sipa: if the peer lies, than you will find it out and ban it
10:38 < sipa> with very low bandwidth and low latency at the same time, downloading in parallel will certainly be faster
10:39 < tonikt> gmaxwell: believe me, I can write a protocol, but it would be quite silly to test it in my home network
10:39 < gmaxwell> tonikt: then use a network simulator.
10:39 < gmaxwell> It's pretty straight forward to simulate actual network behavior.
10:39 < sipa> tonikt: well to deploy it, we'd first need a way to download *anything* in parallel first

10:40 < tonikt> gmaxwell: but I dont need to simulate it to know that downloadin a block in parts from several peers at the same time will be faster
10:40 < gmaxwell> tonikt: But you're incorrect. The way you're describing that involves lots of round trips will actually be _slower_ in a case where there is considerable latency.
10:41 < gmaxwell> The exact balance depends on a number of factors.
10:41 < gmaxwell> Basically you never want to make a request that is smaller than the bandwidth delay product.

10:41 < tonikt> gmaxwell: no, becasue shorter transactions could go in bulk (like 20 * 200+ bytes)
10:41 < tonikt> .. thats why you need to have a way to find out tx size before downloadin it
10:42 < sipa> and just finding out that size means an extra round-trip, which (in some cases) may be slower than just downloading the whole block
10:42 < gmaxwell> Now you're transmitting a bunch of data to make those decisions, and then you can do nothing until it shows up. During that time you could have sent a whole block.
10:43 < tonikt> I think we should be talking numbers here, otherwise it's just baseless accusations
10:43 < gmaxwell> There are numbers above.
10:44 < tonikt> Can we agree that an average node would have 1mbps upload speed?
10:44 < tonikt> download is bigger - I know
10:44 < gmaxwell> Then when you get bad data it's impossible to tell who is giving you bad data until you have the whole block... which means that you have to fetch it from one peer if some peers is giving you bad data.. pretty cheap dos attack.

10:45 < tonikt> gmaxwell: of course you can say who sent you bad data, because you ask for transactions, which hashes you know
10:45 < gmaxwell> tonikt: e.g. I give you the wrong hashes for the block.
10:45 < tonikt> gmaxwell: with a proper difficulty? Smiley
10:45 < gmaxwell> huh?!
10:46 < tonikt> I can live with that
10:46 < gmaxwell> ...
10:46 < tonikt> I will compare the hashes against the merkle from the block
10:46 < sipa> which you don't have yet?
10:46 < gmaxwell> So now you have to fetch the whole merkle tree first. keep adding overhead.. (you'll end up with bip37 in a few more minutes)

10:46 < tonikt> The header is 80 bytes long
10:47 < sipa> the average transaction is 250 bytes or so
10:47 < sipa> a txid is 32
10:47 < tonikt> the minimal transaction is 250 or so Smiley
10:47 < sipa> that means you have to download 1/8 of the block's size before you can make any decision
10:48 < tonikt> I can live with that as well
10:48 < gmaxwell> tonikt: and again, you don't need to change the p2p protocol to expirement— you can just use an alternative p2p protocol, and simulate actual internet conditions.

10:48 < tonikt> its still 8:1 compression Smiley
10:48 < tonikt> I know what I can experiment with, guys
10:49 < tonikt> its just that I dont need to experiment to know that it would be a good thing to do
10:49 < gmaxwell> There are some bandwidth delay mixtures where some strategies are better than other ones, and different strategies are better in other conditions.
10:49 < tonikt> like this think recently, that they fixed
10:49 < tonikt> a peer sends you a longer tx than it should be
10:50 < sipa> that was actually an intentional design decision
10:50 < tonikt> you should ban it - but you cant, because it is likely a legit client, with a bug Smiley
10:50 < gmaxwell> tonikt: imagine— for a moment— that you peers are on mars with a 40 minute latency, and your bandwidth to each peer is 1gbit/sec, and your pay 1 BTC per megabyte transfered in aggregate. What is the optimal strategy?
10:50 < tonikt> gmaxwell: in this case I get your point
10:51 < tonikt> ... but I thought that we were on Earth
10:51 < phantomcircuit> gmaxwell, put your btc node on earth and send it instructions
10:51 < gmaxwell> tonikt: I used an extreme example because you keep rejecting the idea that different situations require different tradeoffs and you keep suggesting ones which have additional round trips, when it's possible to do this _without_ adding them.. so clearly you're not thinking about _something_.
10:51 < sipa> tonikt: banning would be a very bad idea - not because there are buggy clients that add random junk to transactions, but because you'd be hurting the ones forwarding an attacker's junk, not the attacker themself
10:52 < tonikt> so really, please, add an option to ask for tx/block size without a need to download it and allow to do getdata for tx giving a block hash as a reference - that's all I ask Smiley
10:52 < gmaxwell> banning on transitive behavior is a superfantastic way to convert mining dos attacks into network partitioning.
10:52 < gmaxwell> tonikt: We will not add that. Sorry.
10:53 < tonikt> gmaxwell: I know Smiley
10:53 < tonikt> but dont tell me later, that I did not suggest it Tongue
10:53 < sipa> tonikt: the first is unauthenticated data (someone can just lie, and you can claim you protect against it, but your optimal behaviour still depends on them being honest - i really prefer solutions that do not need such an assumption)
10:53 < gmaxwell> tonikt: Please create an alternative transport— in the process you'll learn something about the evils of roundtrips for performance, come up with a better proposal which is potentially useful.
10:54 < gmaxwell> Even a protocol that depends on honesty would be not the end of the world as an alternative transport: just use it between friends.

10:54 < tonikt> sipa: but if someone lies, you will find it out and you will ban it - that is the whole point
10:54 < gmaxwell> but it's not something that makes a lot of sense as the standard p2p protocol.
10:54 < sipa> tonikt: and you'll still have lost time doing so
10:54 < sipa> tonikt: something the attacker may not care about, but you do
10:55 < tonikt> sipa: yes, but you will pay this time to ban the bastard. now you have the same problem, but you cannot ban the bastared
10:55 < gmaxwell> tonikt: IPs are cheap, we regularly get trolls on IRC with access to thousands of IPs. Someone doing that could force all your nodes into wasting a ton of bandwidth and fall back to single peer fetching.
10:55 < gmaxwell> (and make you take many times longer to fetch the block)
10:55 < gmaxwell> I fully endorse my mining competition adopting such a protocol. Tongue
10:55 < tonikt> gmaxwell: 99kb is probably cheaper than in IP
10:55 < tonikt> an*
10:56 < tonikt> so I can download 99kb from the IP,  just to ban it for being wrong
10:56 < tonikt> especially if I know the size up front and I do not donload anything bigger than 10kb
10:57 < gmaxwell> or, you could, you know, use a protocol which doesn't depend on unauthenticated data and which doesn't require extra round
 trips.. and still fetches in parallel (if the bandwidth/latency ratios make it profitable to do so)
10:57 < tonikt> .. and if I see the 10001th byte - I can it already at that moment

10:58 < gmaxwell> And, in fact, BIP37 already gives us that, along with automatic (zero round trip) elimiating of known-already-sent data.
10:58 < tonikt> bip37 was a nice invention. the only problem with it is that nobody wants to use it
10:59 < tonikt> why wont you once invent something that people would want to use it, for a change? Wink
10:59 < gmaxwell> what are you talking about??
10:59 < gmaxwell> lol
10:59 < gmaxwell> every peer connected to me at the moment supports bip37.
10:59 < tonikt> supports, but does not get adventage of it
10:59 < gmaxwell> Now I think you're just trolling.
11:00 < tonikt> I guess you can always find someone who'd kick me out Smiley
11:00 < sipa> between satoahi clients, he's right
11:00 <@gmaxwell> I suppose I could. Tongue
11:00 <@gmaxwell> but seriously, what the heck.
11:00 < sipa> but my cell phone just loves bip37

11:01 < gmaxwell> tonikt: all the bitcoinj clients happily use it— but we don't think parallel fetching is currently useful in the satoshi client.
11:01 < phantomcircuit> iirc the fetch queue ends up pulling more blocks even before the queue has been processed right
11:01 < gmaxwell> After a bunch of archectural changes it might be useful.

11:02 < sipa> it's definitely useful at the block level
11:02 < phantomcircuit> so the pipeline stays full
11:02 < sipa> and we don't even do it there
11:02 < sipa> and we absolutely should
11:02 < gmaxwell> sipa: ::nods::
11:03 < sipa> phantomcircuit: yes, you can have up to 500 queued getdata requests
11:03 < sipa> whuch may take minutes to download
11:03 < gmaxwell> Doesn't involve the unauthenticated data / latency tradeoffs. And couple hundred k blocks are large enough that there isn't an overhead tax from doing that— beyond fetching the headers seperately.
11:11 < tonikt> so anyway guys, to wrap up, I did not mean to be mean, just to indicate my needs. and I appreciate your advise, but I am not going to make a network simulation just to convince you Smiley


So, you proposed instead requesting blocks 32KB at a time using more round trips-- basically demanding protocol changes without doing any testing or analysis to determine the benefit. We invited you to try out the protocol to observe the results, not merely telling you why we expected in common cases it would harm performance.

What you proposed is basically the _opposite_ of compact blocks (which eliminates round trips, and avoids transmitting most of the block at all). Sad Disappointing that you are posting here claiming it was your proposal.
legendary
Activity: 2053
Merit: 1356
aka tonikt
January 14, 2017, 12:37:38 PM
#23
Seriously man, this happens all the time.
Whenever a person from outside the circle comes with an idea, you either tell him that the idea is stupid or just not worth working on.

I've seen it so many times,  that I'm sick of it already.
Just me, years ago, trying to talk on #bitcoin-dev about ways to speed up block downloading times. That was a no no - block propagation didn't need improving because you didn't think it was important... It almost got me banned from the channel..  And yet we are; 2016 - a new brilliant feature that the bitcoin core team is so proud of: compact fucking blocks!

Or: How many times have I tried to discuss a way of solving the bootstrapping issue once and for all by extending the protocol to allow a secured distribution of  the utxo db?
No fucking way to discuss it,  because first you don't find it important, then you are 'way ahead'  of me with the design,  and then you still don't know how to do the hashing of the fucking records - and that took you like 4 years to realize....
Just like I had said: at the end it's still going to be done,  except that it will take at least 10 years, because you're too busy now with other stuff and this specific thing is too big of a deal for your ego to let anyone from outside to claim credit for solving.
staff
Activity: 4284
Merit: 8808
January 14, 2017, 11:49:28 AM
#22
Just go back to the top of this topic.
Guys came with some ideas to optimize the lib - maybe not the most brilliant ones,  but also definitely not a stupid ones...
And what was your reaction?
You basically told them off.
You do it all the time.

Wtf?  Someone asked about two publications, asking if each would be helpful.   I responded that one would likely not be, the other would be somewhat and I pointed out it would be fairly easy to try. I certainly didn't tell them off!

I'll upload here 2 versions of /src/field_5x52_asm_impl.h that I've kind of hacked, one using memory, the other xmm registers.

The commentary is not good because it's not production level - just fooling around* with the data flow so that the data get from one end to the other faster, with less code imprint. I've never had them run on anything beside my Q8200, and I'm wondering on the behavior of modern cpus. I'd appreciate if you (or anyone else) can run a benchmark (baseline) + these 2, and perhaps a time ./tests as a more real-world performance.

If I do a ./time tests, both run faster by a second (58.2 seconds baseline with endomorphism down to 57.2 seconds in my underclocked Q8200 @ 1.86gz), although the memory version seems faster in the benchmarks. I have a theory on
Neat.  You shouldn't benchmark using the tests: they're full of debugging instrumentation that distorts the performance and spend a lot of their time on random things.  Compile with --enable-benchmarks and use the benchmarks. Smiley


A quick check on i7-4600U doesn't give a really clear result:


Before:
field_sqr: min 0.0915us / avg 0.0917us / max 0.0928us
field_mul: min 0.116us / avg 0.116us / max 0.117us
field_inverse: min 25.2us / avg 25.7us / max 28.5us
field_inverse_var: min 13.8us / avg 13.9us / max 14.0us
field_sqrt: min 24.9us / avg 25.0us / max 25.2us
ecdsa_verify: min 238us / avg 238us / max 239us

After (v1):
field_sqr: min 0.0924us / avg 0.0924us / max 0.0928us
field_mul: min 0.117us / avg 0.117us / max 0.117us
field_inverse: min 25.4us / avg 25.5us / max 25.9us
field_inverse_var: min 13.7us / avg 13.7us / max 14.0us
field_sqrt: min 25.1us / avg 25.3us / max 26.1us
ecdsa_verify: min 237us / avg 237us / max 237us

After (v2):
field_sqr: min 0.0942us / avg 0.0942us / max 0.0944us
field_mul: min 0.118us / avg 0.118us / max 0.119us
field_inverse: min 25.9us / avg 26.0us / max 26.4us
field_inverse_var: min 13.6us / avg 13.7us / max 13.8us
field_sqrt: min 25.6us / avg 25.9us / max 27.8us
ecdsa_verify: min 243us / avg 244us / max 246us


legendary
Activity: 2053
Merit: 1356
aka tonikt
January 14, 2017, 11:33:23 AM
#21

Quote
Which means that if you want to change any bit there,  you have to play the bitcoin celebrities PR game, which is mostly about indulging a  big egos of a few funny characters.
Which is just something you're purely making up. And it's unfortunate because if other people read it and don't know that you're extrapolating based on your imagination and fears they might not contribute where they otherwise might. That does the world a disservice.

Man, how long have I been here?

What I've said has zero to do with fears and all to do with my observations and experience.

Just go back to the top of this topic.
Guys came with some ideas to optimize the lib - maybe not the most brilliant ones,  but also definitely not a stupid ones...
And what was your reaction?
You basically told them off.
You do it all the time.

If there was a ranking of people scaring newcomers away from contributing into the code,  you'd be on top of it.
Because you always know better. Some  others 'core devs' have quite similar characters. Just a few,  but they are enough to scare new people from contributing. Especially a talented, brilliant people won't be willing to put up with this shit,  that you guys throw at them.
legendary
Activity: 1708
Merit: 1049
January 14, 2017, 10:54:56 AM
#20
There are 3 benchmarks

bench_internal
bench_verify
bench_sign

which are built by ./configure --enable-benchmark

As for the difference in test speed, might have to do with some lines in tests.c which indicate a different number of rounds (plus tests for endomorphism) if endomorphism is on.

Built your program with endomorphism (./configure --enable-endomorphism) and report back with the results, should be faster.

Ok - I did.

bench_verify shows speedup with endomorphism

ecdsa_verify: min 42.0us / avg 42.2us / max 43.0us  (with)
ecdsa_verify: min 57.7us / avg 57.8us / max 58.4us  (without)

bench_internal shows no improvements (within measure tolerance) except one:

wnaf_const: min 0.0887us / avg 0.0920us / max 0.102us (with)
wnaf_const: min 0.155us / avg 0.161us / max 0.171us     (without)

I doubt this would cause the speedup from above.

Rico


I'll upload here 2 versions of /src/field_5x52_asm_impl.h that I've kind of hacked, one using memory, the other xmm registers.

The commentary is not good because it's not production level - just fooling around* with the data flow so that the data get from one end to the other faster, with less code imprint. I've never had them run on anything beside my Q8200, and I'm wondering on the behavior of modern cpus. I'd appreciate if you (or anyone else) can run a benchmark (baseline) + these 2, and perhaps a time ./tests as a more real-world performance.

If I do a ./time tests, both run faster by a second (58.2 seconds baseline with endomorphism down to 57.2 seconds in my underclocked Q8200 @ 1.86gz), although the memory version seems faster in the benchmarks. I have a theory on why the xmm version sucks in benchmarks (OS context switches being more expensive for also saving the xmm reg set?) but the bottom line is it seems faster than baseline when doing a timed test run (more real-world application)... Security-wise, I wouldn't want to let data hanging around on the XMM registers though.

(*What I wanted to do is to reduce opcode size, instruction count and memory accesses by reducing the number of temporary variables from 3 to 2 or 1, while interleaving muls with adds).


Version 1 - normal/memory:

Code:
/**********************************************************************
 * Copyright (c) 2013-2014 Diederik Huys, Pieter Wuille               *
 * Distributed under the MIT software license, see the accompanying   *
 * file COPYING or http://www.opensource.org/licenses/mit-license.php.*
 **********************************************************************/

/**
 * Changelog:
 * - March 2013, Diederik Huys:    original version
 * - November 2014, Pieter Wuille: updated to use Peter Dettman's parallel multiplication algorithm
 * - December 2014, Pieter Wuille: converted from YASM to GCC inline assembly
 */

#ifndef _SECP256K1_FIELD_INNER5X52_IMPL_H_
#define _SECP256K1_FIELD_INNER5X52_IMPL_H_

SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint64_t *r, const uint64_t *a, const uint64_t * SECP256K1_RESTRICT b) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            r15:rcx = d
 *            r10-r14 = a0-a4
 *            rbx     = b
 *            rdi     = r
 *            rsi     = a / t?
 */
  uint64_t tmp1, tmp2;
__asm__ __volatile__(
    "movq 24(%%rsi),%%r13\n"
    "movq 0(%%rbx),%%rax\n"
    "movq 32(%%rsi),%%r14\n"
    /* d += a3 * b0 */
    "mulq %%r13\n"
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq %%rax,%%r9\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 8(%%rbx),%%rax\n"
    "movq %%rdx,%%rsi\n"
    /* d += a2 * b1 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b2 */
    "mulq %%r11\n"
    "movq $0x1000003d10,%%rcx\n"
    "movq $0xfffffffffffff,%%r15\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d = a0 * b3 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* c = a4 * b4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "shrdq $52,%%rdx,%%r8\n"     /* c >>= 52 (%%r8 only) */
    /* d += (c & M) * R */
    "andq %%r15,%%rax\n"
    "mulq %%rcx\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t3 (tmp1) = d & M */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%q1\n"  
    /* d >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* d += a4 * b0 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b1 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b2 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b3 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a0 * b4 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
     /* d += c * R */
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    "mulq %%r8\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t4 = d & M (%%r15) */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rax,%%r15\n"
    "shrq $48,%%r15\n" /*Q3*/
    
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rdx\n"
    "andq %%rdx,%%rax\n"
    "movq %%rax,%q2\n"
    /*"movq %q2,%%r15\n" */
    "movq 0(%%rbx),%%rax\n"
    /* c = a0 * b0 */
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += a4 * b1 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b2 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b3 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b4 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    
    "movq %%r15,%%rax\n"  /*Q3 transfered*/
    
    /* u0 = d & M (%%r15) */
    "movq %%r9,%%rdx\n"
    "shrdq $52,%%rsi,%%r9\n"
    "movq $0xfffffffffffff,%%r15\n"
    "xor %%esi, %%esi\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */

    /* u0 = (u0 << 4) | tx (%%r15) */
    "shlq $4,%%rdx\n"
    "orq %%rax,%%rdx\n"
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a1 * b0 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b1 */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b2 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b3 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b4 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq 0(%%rbx),%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a2 * b0 */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq 8(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a1 * b1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b2 (last use of %%r10 = a0) */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    /* fetch t3 (%%r10, overwrites a0), t4 (%%r15) */
    "movq 24(%%rbx),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b3 */
    "mulq %%r14\n"
    "movq %q1,%%r10\n"
    "xor %%esi, %%esi\n"
    "addq %%rax,%%r9\n"
    "movq 32(%%rbx),%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b4 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq $0x1000003d10,%%r11\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 (%%r9 only) */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %q2,%%rsi\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t3 */
    "xor %%ecx,%%ecx\n"
    "movq %%r9,%%rax\n"
    "addq %%r10,%%r8\n"
    /* c += d * R */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%rsi,%%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
: "+S"(a), "=m"(tmp1), "=m"(tmp2)
: "b"(b), "D"(r)
: "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

SECP256K1_INLINE static void secp256k1_fe_sqr_inner(uint64_t *r, const uint64_t *a) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            rcx:rbx = d
 *            r10-r14 = a0-a4
 *            r15     = M (0xfffffffffffff)
 *            rdi     = r
 *            rsi     = a / t?
 */
  uint64_t tmp1a;
__asm__ __volatile__(
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 24(%%rsi),%%r13\n"
    "movq 32(%%rsi),%%r14\n"
    "leaq (%%r10,%%r10,1),%%rax\n"
    "movq $0xfffffffffffff,%%r15\n"
    /* d = (a0*2) * a3 */
    "mulq %%r13\n"
    "movq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += (a1*2) * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"
    "movq %%r14,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c = a4 * a4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "movq %%rdx,%%r9\n"
    /* d += (c & M) * R */
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r9,%%r8\n"
    /* t3 (tmp1) = d & M */
    "movq %%rbx,%%rsi\n"
    "andq %%r15,%%rsi\n" /*Q1 became rsi*/
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    /* a4 *= 2 */
    "movq %%r10,%%rax\n"
    "addq %%r14,%%r14\n"
    /* d += a0 * a4 */
    "mulq %%r14\n"
    "xor %%ecx,%%ecx\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d+= (a1*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a2 * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"

    /* d += c * R */
    "movq %%r8,%%rax\n"
    "movq $0x1000003d10,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    "mulq %%r8\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* t4 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rdx,%%r15\n"
    "shrq $48,%%r15\n" /*Q3=R15*/
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    "movq %%rdx,%q1\n"/*Q2 OUT - renamed to q1*/
    /* c = a0 * a0 */
    "movq %%r10,%%rax\n"
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%rdx,%%r9\n"
    /* d += a1 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r12,%%r12,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += (a2*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* u0 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "movq $0xfffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* u0 = (u0 << 4) | tx (%%rsi) */
    "shlq $4,%%rdx\n"
    "orq %%r15,%%rdx\n" /*Q3 - R15 RETURNS*/
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "movq $0xfffffffffffff,%%r15\n" /*R15 back in its place*/
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"    
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* a0 *= 2 */
    "addq %%r10,%%r10\n"
    /* c += a0 * a1 */
    "movq %%r10,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a2 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a3 * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%r10,%%rax\n"
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* c += a0 * a2 (last use of %%r10) */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %q1,%%r12\n" /*Q2 RETURNS*/
    "adcq %%rdx,%%r9\n"
    /* fetch t3 (%%r10, overwrites a0),t4 (%%rsi) */
    /*"movq %q1,%%r10\n" */
    /* c += a1 * a1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a3 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq $0x1000003d10,%%r13\n"
    "mulq %%r13\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 (%%rbx only) */
    "shrdq $52,%%rcx,%%rbx\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r14,%%r14\n"
    /* c += t3 */
    "movq %%rbx,%%rax\n"
    "addq %%rsi,%%r8\n" /*RSI = Q1*/
    /* c += d * R */
    "mulq %%r13\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r14\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r14,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%r12, %%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
: "+S"(a), "=m"(tmp1a)
: "D"(r)
: "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

#endif


Version 2 - more xmm reg use

Code:
/**********************************************************************
 * Copyright (c) 2013-2014 Diederik Huys, Pieter Wuille               *
 * Distributed under the MIT software license, see the accompanying   *
 * file COPYING or http://www.opensource.org/licenses/mit-license.php.*
 **********************************************************************/

/**
 * Changelog:
 * - March 2013, Diederik Huys:    original version
 * - November 2014, Pieter Wuille: updated to use Peter Dettman's parallel multiplication algorithm
 * - December 2014, Pieter Wuille: converted from YASM to GCC inline assembly
 */

#ifndef _SECP256K1_FIELD_INNER5X52_IMPL_H_
#define _SECP256K1_FIELD_INNER5X52_IMPL_H_

SECP256K1_INLINE static void secp256k1_fe_mul_inner(uint64_t *r, const uint64_t *a, const uint64_t * SECP256K1_RESTRICT b) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            r15:rcx = d
 *            r10-r14 = a0-a4
 *            rbx     = b
 *            rdi     = r
 *            rsi     = a / t?
 */
/* xmm0 = q1 xmm6=q2    */
/* This has 17 mem accesses + 17 xmm uses vs 35 mem access and no xmm use*/

__asm__ __volatile__(
    "push %%rbx\n"
    "movq %%rsp, %%xmm1\n"
    "movq %%rbp, %%xmm2\n"
    "movq %%rdi, %%xmm3\n"
    "movq 0(%%rbx),%%rdi\n"
    "movq 8(%%rbx),%%rbp\n"
    "movq 16(%%rbx),%%rsp\n"
    "movq %%rdi,%%xmm4\n"
    
    "movq 24(%%rsi),%%r13\n"
    "movq %%rdi,%%rax\n"
    "movq 32(%%rsi),%%r14\n"
    /* d += a3 * b0 */
    "mulq %%r13\n"
    "movq 0(%%rsi),%%r10\n"
    "movq %%rax,%%r9\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq %%rbp,%%rax\n"
    "movq %%rdx,%%rsi\n"
    /* d += a2 * b1 */
    "mulq %%r12\n"
    "movq 24(%%rbx),%%rcx\n"
    "movq 32(%%rbx),%%rbx\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b2 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d = a0 * b3 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* c = a4 * b4 */
    "mulq %%r14\n"
    "movq $0xfffffffffffff,%%r15\n"
    "movq %%rax,%%r8\n"
    /* d += (c & M) * R */
    "andq %%r15,%%rax\n"
    "shrdq $52,%%rdx,%%r8\n"     /* c >>= 52 (%%r8 only) */
    "movq $0x1000003d10,%%rdx\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t3 (tmp1) = d & M */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%%xmm0\n"  
    /* d >>= 52 */
    "movq %%rdi,%%rax\n"
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* d += a4 * b0 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b1 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b2 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq %%rcx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b3 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a0 * b4 */
    "mulq %%r10\n"
    "addq %%rax,%%r9\n"
     /* d += c * R */
    "movq $0x1000003d10,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    "mulq %%r8\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* t4 = d & M (%%r15) */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    "xor %%esi,%%esi\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rax,%%r15\n"
    "shrq $48,%%r15\n" /*Q3*/
    
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rdx\n"
    "andq %%rdx,%%rax\n"
    "movq %%rax,%%xmm6\n"
    /*"movq %q2,%%r15\n" */
    "movq %%rdi,%%rax\n"
    /* c = a0 * b0 */
    "mulq %%r10\n"
    "movq %%rcx,%%xmm5\n"
    "movq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += a4 * b1 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b2 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b3 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a1 * b4 */
    "mulq %%r11\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    
    "movq %%r15,%%rax\n"  /*Q3 transfered*/
    
    /* u0 = d & M (%%r15) */
    "movq %%r9,%%rdx\n"
    "shrdq $52,%%rsi,%%r9\n"
    "movq $0xfffffffffffff,%%r15\n"
    "xor %%esi, %%esi\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */

    /* u0 = (u0 << 4) | tx (%%r15) */
    "shlq $4,%%rdx\n"
    "orq %%rax,%%rdx\n"
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,%%rdx\n"
        /* c >>= 52 */
    "movq %%rdi,%%rax\n"
    "movq %%xmm3, %%rdi\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    "movq %%rdx,0(%%rdi)\n"
    /* c += a1 * b0 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b1 */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b2 */
    "mulq %%r14\n"
    "addq %%rax,%%r9\n"
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b3 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a2 * b4 */
    "mulq %%r12\n"
    "addq %%rax,%%r9\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "movq $0x1000003d10,%%rdx\n"
    "andq %%r15,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%xmm4,%%rax\n"
    "shrdq $52,%%rcx,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += a2 * b0 */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%rbp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a1 * b1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%rsp,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c += a0 * b2 (last use of %%r10 = a0) */
    "mulq %%r10\n"
    "addq %%rax,%%r8\n"
    /* fetch t3 (%%r10, overwrites a0), t4 (%%r15) */
    "movq %%xmm5,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a4 * b3 */
    "mulq %%r14\n"
    "movq %%xmm0,%%r10\n"
    "xor %%esi, %%esi\n"
    "addq %%rax,%%r9\n"
    "movq %%rbx,%%rax\n"
    "adcq %%rdx,%%rsi\n"
    /* d += a3 * b4 */
    "mulq %%r13\n"
    "addq %%rax,%%r9\n"
    "movq $0x1000003d10,%%rbx\n"
    "adcq %%rdx,%%rsi\n"
    /* c += (d & M) * R */
    "movq %%r9,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rbx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* d >>= 52 (%%r9 only) */
    "shrdq $52,%%rsi,%%r9\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t3 */
    "movq %%r9,%%rax\n"
    "addq %%r10,%%r8\n"
    "xor %%ecx,%%ecx\n"
    /* c += d * R */
    "mulq %%rbx\n"
    "movq %%xmm1, %%rsp\n"
    "movq %%xmm2, %%rbp\n"
    "movq %%xmm6,%%rsi\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%rcx\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%rcx,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%rsi,%%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"
    "pop %%rbx\n"
: "+S"(a)
: "b"(b), "D"(r)
: "%rax", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

SECP256K1_INLINE static void secp256k1_fe_sqr_inner(uint64_t *r, const uint64_t *a) {
/**
 * Registers: rdx:rax = multiplication accumulator
 *            r9:r8   = c
 *            rcx:rbx = d
 *            r10-r14 = a0-a4
 *            r15     = M (0xfffffffffffff)
 *            rdi     = r
 *            rsi     = a / t?
 */
/* tmp1a = xmm0 */
__asm__ __volatile__(
    "movq %%rsp, %%xmm1\n"
    "movq %%rbp, %%xmm2\n"
    "movq 0(%%rsi),%%r10\n"
    "movq 8(%%rsi),%%r11\n"
    "movq 16(%%rsi),%%r12\n"
    "movq 24(%%rsi),%%r13\n"
    "movq 32(%%rsi),%%r14\n"
    "leaq (%%r10,%%r10,1),%%rax\n"
    "movq $0xfffffffffffff,%%r15\n"
    /* d = (a0*2) * a3 */
    "mulq %%r13\n"
    "movq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "movq %%rdx,%%rcx\n"
    /* d += (a1*2) * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"
    "movq %%r14,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* c = a4 * a4 */
    "mulq %%r14\n"
    "movq %%rax,%%r8\n"
    "movq %%rdx,%%r9\n"
    /* d += (c & M) * R */
    "movq $0x1000003d10,%%rsp\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r9,%%r8\n"
    /* t3 (tmp1) = d & M */
    "movq %%rbx,%%rsi\n"
    "andq %%r15,%%rsi\n" /*Q1 OUT*/
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    /* a4 *= 2 */
    "movq %%r10,%%rax\n"
    "addq %%r14,%%r14\n"
    /* d += a0 * a4 */
    "mulq %%r14\n"
    "xor %%ecx,%%ecx\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r11,%%r11,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d+= (a1*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a2 * a2 */
    "mulq %%r12\n"
    "addq %%rax,%%rbx\n"

    /* d += c * R */
    "movq %%r8,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    "mulq %%rsp\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* t4 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "andq %%r15,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* tx = t4 >> 48 (tmp3) */
    "movq %%rdx,%%rbp\n"
    "shrq $48,%%rbp\n" /*Q3 OUT*/
    /* t4 &= (M >> 4) (tmp2) */
    "movq $0xffffffffffff,%%rax\n"
    "andq %%rax,%%rdx\n"
    "movq %%rdx,%%xmm0\n"/*Q2 OUT*/
    /* c = a0 * a0 */
    "movq %%r10,%%rax\n"
    "mulq %%r10\n"
    "movq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%rdx,%%r9\n"
    /* d += a1 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "leaq (%%r12,%%r12,1),%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += (a2*2) * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* u0 = d & M (%%rsi) */
    "movq %%rbx,%%rdx\n"
    "movq %%r15,%%rax\n"
    "andq %%rax,%%rdx\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* u0 = (u0 << 4) | tx (%%rsi) */
    "shlq $4,%%rdx\n"
    "orq %%rbp,%%rdx\n" /*Q3 RETURNS*/
    /* c += u0 * (R >> 4) */
    "movq $0x1000003d1,%%rax\n"
    "mulq %%rdx\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"    
    /* r[0] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,0(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* a0 *= 2 */
    "addq %%r10,%%r10\n"
    /* c += a0 * a1 */
    "movq %%r10,%%rax\n"
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r12,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a2 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%rcx\n"
    /* d += a3 * a3 */
    "mulq %%r13\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 */
    "shrdq $52,%%rcx,%%rbx\n"
    "xor %%ecx,%%ecx\n"
    /* r[1] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,8(%%rdi)\n"
    /* c >>= 52 */
    "movq %%r10,%%rax\n"
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r9,%%r9\n"
    /* c += a0 * a2 (last use of %%r10) */
    "mulq %%r12\n"
    "addq %%rax,%%r8\n"
    "movq %%r11,%%rax\n"
    "movq %%xmm0,%%r12\n" /*Q2 RETURNS*/
    "adcq %%rdx,%%r9\n"
    /* fetch t3 (%%r10, overwrites a0),t4 (%%rsi) */
    /*"movq %q1,%%r10\n" */
    /* c += a1 * a1 */
    "mulq %%r11\n"
    "addq %%rax,%%r8\n"
    "movq %%r13,%%rax\n"
    "adcq %%rdx,%%r9\n"
    /* d += a3 * a4 */
    "mulq %%r14\n"
    "addq %%rax,%%rbx\n"
    "adcq %%rdx,%%rcx\n"
    /* c += (d & M) * R */
    "movq %%rbx,%%rax\n"
    "andq %%r15,%%rax\n"
    "mulq %%rsp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r9\n"
    /* d >>= 52 (%%rbx only) */
    "shrdq $52,%%rcx,%%rbx\n"
    /* r[2] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,16(%%rdi)\n"
    /* c >>= 52 */
    "shrdq $52,%%r9,%%r8\n"
    "xorq %%r14,%%r14\n"
    /* c += t3 */
    "movq %%rbx,%%rax\n"
    "addq %%rsi,%%r8\n" /*RSI = Q1 RETURNS*/
    /* c += d * R */
    "mulq %%rsp\n"
    "movq %%xmm1, %%rsp\n"
    "movq %%xmm2, %%rbp\n"
    "addq %%rax,%%r8\n"
    "adcq %%rdx,%%r14\n"
    /* r[3] = c & M */
    "movq %%r8,%%rax\n"
    "andq %%r15,%%rax\n"
    "movq %%rax,24(%%rdi)\n"
    /* c >>= 52 (%%r8 only) */
    "shrdq $52,%%r14,%%r8\n"
    /* c += t4 (%%r8 only) */
    "addq %%r12, %%r8\n"
    /* r[4] = c */
    "movq %%r8,32(%%rdi)\n"

: "+S"(a)
: "D"(r)
: "%rax", "%rbx", "%rcx", "%rdx", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "%r14", "%r15", "cc", "memory");
}

#endif

staff
Activity: 4284
Merit: 8808
January 14, 2017, 10:14:27 AM
#19
Strange. I was under the impression that

allowed me a tiny little bit of moaning.
It's a trivial optimization that already existed in three other places in the code. Thanks for noticing that it hadn't been performed there, and providing code for one of the two places it needed to be improved, but come on-- you're just making yourself look foolish here with the rude attitude while you're clearly fairly ignorant about what you're talking about overall. Case in point:

The endormorphism makes verification and ECDH significantly faster. It doesn't do anything else beyond additional endomorphism related tests.  

It does not make pubkey generation faster and really can't (well it could be used with some effort to halve the size of the in-memory tables at a small performance penalty.).

It's a little absurd that you insult a ~27% performance increase to verification while bragging about a under-half-percent change to verification performance.  It seems to me that you're trying to compensate for ignorance by insulting a lot, it might fool a few people who just don't know much of anything-- but not anyone else.  And it prevents you from learning. I do think wanna-be applies, in spades, and if you keep with that attitude it will probably continue to do so.

It wasn't 'the collective'.
Sipa wrote the entire library,  from scratch,  all by himself.
That is far from an accurate history, but it doesn't matter-- Pieter did do the lionshare of the work but he didn't do it in isolation. But less fortunate, is where your internal imagination continues and you write:

Quote
Which means that if you want to change any bit there,  you have to play the bitcoin celebrities PR game, which is mostly about indulging a  big egos of a few funny characters.
Which is just something you're purely making up. And it's unfortunate because if other people read it and don't know that you're extrapolating based on your imagination and fears they might not contribute where they otherwise might. That does the world a disservice.

Quote
And you can't seriously expect from and coder  to work on his personal  hobby project and  then deliver it with the industry standards documentation.
What kind of documentation would you even expect from a lib that provides a simple EC math functions?  The function names are descriptive enough if you understand the operations they provide. And if you don't understand the math behind it then no document is going to help you anyway.

But what is strange is that it is _extensively_ documented.

E.g. above rico666 commented on secp256k1_fe_cmov -- even if someone were ignorant enough of the subject and the conventions used to not immediately know that this function performed a conditional move of a field element, or of programming enough to not know what a conditional move was; there is documentation (in this case apparently added by me):

Code:
/** If flag is true, set *r equal to *a; otherwise leave it. Constant-time. */
static void secp256k1_fe_cmov(secp256k1_fe *r, const secp256k1_fe *a, int flag);

... even though this is purely internal code and is not accessible to an end user of the library.
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 14, 2017, 03:21:41 AM
#18
There is no question that at this moment sipa's secp256k1 lib is the fastest solution on the market.

And you can complain  all you want about messy coding or poor documentation, but unless you provide an alternative to prove that it can be done so much better... well,  then it's just going to be a moaning.

And just a moaning isn't very professional.

Strange. I was under the impression that

Code:
field_get_b32: min 0.647us / avg 0.666us / max 0.751us
field_set_b32: min 0.551us / avg 0.571us / max 0.624us

becomes

field_get_b32: min 0us / avg 0.0000000477us / max 0.000000238us
field_set_b32: min 0us / avg 0.0000000238us / max 0.000000238us

allowed me a tiny little bit of moaning.


Rico
legendary
Activity: 2053
Merit: 1356
aka tonikt
January 13, 2017, 05:12:09 PM
#17
There is no question that at this moment sipa's secp256k1 lib is the fastest solution on the market.

And you can complain  all you want about messy coding or poor documentation, but unless you provide an alternative to prove that it can be done so much better... well,  then it's just going to be a moaning.

And just a moaning isn't very professional.

Sipa didn't go to openssl forum saying how shitty their implementation was - he just made a better one,  to prove the point. And he proved it so well that now nobody bothers to make an effort in beating him. Smiley
legendary
Activity: 2053
Merit: 1356
aka tonikt
January 13, 2017, 04:46:57 PM
#16

Sipa you say? Why did he abandon the project? Was it just some proof of work?

Proof of work that have already found so many applications, including one in your project. I guess you can call it however you like.

I don't know him and can't speak  for him,  but I wouldn't say he abandoned it. Rather decided it was complete enough and moved on with his life, like we do all the time.  Even satoshi did that with his big project.
And you're also not going to work on a single project all your life,  are you?
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 13, 2017, 01:35:04 PM
#15
There are 3 benchmarks

bench_internal
bench_verify
bench_sign

which are built by ./configure --enable-benchmark

As for the difference in test speed, might have to do with some lines in tests.c which indicate a different number of rounds (plus tests for endomorphism) if endomorphism is on.

Built your program with endomorphism (./configure --enable-endomorphism) and report back with the results, should be faster.

Ok - I did.

bench_verify shows speedup with endomorphism

ecdsa_verify: min 42.0us / avg 42.2us / max 43.0us  (with)
ecdsa_verify: min 57.7us / avg 57.8us / max 58.4us  (without)

bench_internal shows no improvements (within measure tolerance) except one:

wnaf_const: min 0.0887us / avg 0.0920us / max 0.102us (with)
wnaf_const: min 0.155us / avg 0.161us / max 0.171us     (without)

I doubt this would cause the speedup from above.


Rico
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 13, 2017, 01:13:06 PM
#14
And you can't seriously expect from and coder  to work on his personal  hobby project and  then deliver it with the industry standards documentation.

Depends on the coder - I guess. As long as I have interest in my hobby project, I want it to be perfect. I for one have no problem to admit that my LBC projects still sucks badly in many places right now. I intend to improve. Documentation, Ease of use, Speed, One of them is to get decent EC performance on a GPU - that's why I am looking into this at all.
 
Quote
What kind of documentation would you even expect from a lib that provides a simple EC math functions?  The function names are descriptive enough if you understand the operations they provide. And if you don't understand the math behind it then no document is going to help you anyway.

Well - for me the names are pretty bad.  "secp256k1_fe_cmov" *thumbs up* But it's not only that. The data structures are pretty sideways too. Actually I understand the math pretty well, that's why I am so puzzled about what I see - still unsure if the lib is seriously doing what I think it's doing. I don't think the person who wrote this did really care for performance - he probably just wanted something that sucked less.

Sipa you say? Why did he abandon the project? Was it just some proof of work?


Rico
legendary
Activity: 2053
Merit: 1356
aka tonikt
January 13, 2017, 08:34:34 AM
#13
It wasn't 'the collective'.
Sipa wrote the entire library,  from scratch,  all by himself.
The guys just took his code,  adding some pretty useless checks and heavy building system around it,  and now it's 'officially' hosted from bitcoin/secp256k1, as the 'community project' . Which means that if you want to change any bit there,  you have to play the bitcoin celebrities PR game, which is mostly about endulging a  big egos of a few funny characters.

But if you check the history of sipa/secp256k1 you can see that it used to quite easy to commit optimizations into that code.

It was all done by one person as his personal,  partially experimental  project and I personally admire the work.
And you can't seriously expect from and coder  to work on his personal  hobby project and  then deliver it with the industry standards documentation.
What kind of documentation would you even expect from a lib that provides a simple EC math functions?  The function names are descriptive enough if you understand the operations they provide. And if you don't understand the math behind it then no document is going to help you anyway.
legendary
Activity: 1708
Merit: 1049
January 13, 2017, 07:30:47 AM
#12
One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?

I don't think patents are any problem with the endomorphism code. The code itself is the problem. Not sure which benchmarks you are referring to, but if I take a (very coarse) look on benchmarks on my system, USE_ENDOMORPHISM is nothing you'd like to enable:

Code:
Times for tests:

gcc version 6.3.1 20170109 (GCC)

1) CFLAGS -g -O2
real    0m14.365s
user    0m14.357s
sys     0m0.007s

2) CFLAGS -O3 -march=sklake
real    0m13.549s
user    0m13.547s
sys     0m0.000s

3) CFLAGS -O3 -march=sklake & USE_ENDOMORPHISM 1
real    0m15.660s
user    0m15.660s
sys     0m0.000s

4) CFLAGS -g -O2 & USE_ENDOMORPHISM 1
real    0m16.139s
user    0m16.137s
sys     0m0.000s

5) CFLAGS -g -O2 & undef USE_ASM_X86_64
real    0m14.849s
user    0m14.847s
sys     0m0.000s

6) CFLAGS -O3 -march=sklake & undef USE_ASM_X86_64
real    0m14.520s
user    0m14.517s
sys     0m0.000s

So yes, the beef seems to be in better assembler code and ditching endomorphism.
On modern CPUs, ditch that old gcc too and use -O3 (forget what you've heard about it in the past years).

Rico


There are 3 benchmarks

bench_internal
bench_verify
bench_sign

which are built by ./configure --enable-benchmark

As for the difference in test speed, might have to do with some lines in tests.c which indicate a different number of rounds (plus tests for endomorphism) if endomorphism is on.

Built your program with endomorphism (./configure --enable-endomorphism) and report back with the results, should be faster.
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 13, 2017, 06:54:48 AM
#11
One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?

I don't think patents are any problem with the endomorphism code. The code itself is the problem. Not sure which benchmarks you are referring to, but if I take a (very coarse) look on benchmarks on my system, USE_ENDOMORPHISM is nothing you'd like to enable:

Code:
Times for tests:

gcc version 6.3.1 20170109 (GCC)

1) CFLAGS -g -O2
real    0m14.365s
user    0m14.357s
sys     0m0.007s

2) CFLAGS -O3 -march=sklake
real    0m13.549s
user    0m13.547s
sys     0m0.000s

3) CFLAGS -O3 -march=sklake & USE_ENDOMORPHISM 1
real    0m15.660s
user    0m15.660s
sys     0m0.000s

4) CFLAGS -g -O2 & USE_ENDOMORPHISM 1
real    0m16.139s
user    0m16.137s
sys     0m0.000s

5) CFLAGS -g -O2 & undef USE_ASM_X86_64
real    0m14.849s
user    0m14.847s
sys     0m0.000s

6) CFLAGS -O3 -march=sklake & undef USE_ASM_X86_64
real    0m14.520s
user    0m14.517s
sys     0m0.000s

So yes, the beef seems to be in better assembler code and ditching endomorphism.
On modern CPUs, ditch that old gcc too and use -O3 (forget what you've heard about it in the past years).

Rico
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 13, 2017, 04:55:40 AM
#10
Can you show another library for the same application within a factor of _five_ of the performance?  Or with more than 1/5th the documentation?

I'm just ... beside myself at your comment, it's not even insulting: it's just too absurd.

Good. We seem to live in different worlds then, each others world seeming absurd to the other one.

I for one, find your "argument" of requesting to be shown another lib doing the same thing with 5 perf 1/5 docs quite absurd. It's like claiming that anything in the world by definition cannot suck when there is nothing else that sucks less. Really absurd.

Quote
You might have noticed that the constant unpacking macro does effectively the same thing: https://github.com/bitcoin-core/secp256k1/blob/master/src/field_5x52.h#L22 but a word at a time.

No I haven't, because reading the secp256k1 code is a major PITA, but thanks for the pointer. That wheel would probably be the next thing I'd reinvented.

Quote
If you open a PR on the function you should add it to the bench_internal benchmarks as you go.

What's a PR? Press release? Ah. pull request I suppose. You're assuming too much. E.g. that the whole world runs on - or is - git.

Quote
Quote
Is there any place where R&D discussion about further development is ongoing? Before I'd start re-implementing that mess from scratch I'd prefer to participate in some "official endeavor".

same place it's always been https://github.com/bitcoin-core/secp256k1/

If that is the "discussion" place, no wonder I didn't see it. Srsly?

Quote
Disinterest in spoon feeding people who sound like wanna-be thieves who thefts would scare ordinary people away from Bitcoin and whom can't even be bothered to RTFM (there has been fairly detailed documentation for the library for years) is not evidence of a lack of interest in further development.

My dear young gmaxwell: Evidently, after you did the diligent work opening a PR  Smiley, benchmarked the code in the process, reformatting it for the worse, but reordering it for the better, found out and stated that my code is about 6-7 orders of magnitude faster than the original code (and I am not primarily a C hacker) ... you have the guts to use the term "wanna-be" when addressing any of your texts towards me?

Let alone the fact that there was such a tremendously suboptimal code for such a long time should teach you something. Let me assure you, your perception of the situation here is skewed at best. From what I see in the secp256k1 lib, the collective who did the job are good programmers with potential. Motivated, young, inexperienced, but potential.

If it wasn't for the LBC hobby project of mine, I would have never had looked in the mess that is the secp256k1 library. I did and I commented. You don't like the comment, maybe feel it being insulting, puzzling or absurd. Fine. I have high hopes that before you are in your mid-40ties you will understand what my comment was about. As I said: "potential".

Rationalizing the poor state of an open source project is something I came across in the Linux world since the 90ties. To me, your statements are neither new nor original. So should you - some day - come to the conclusion that you could attract a certain kind of programmers when the open source software reaches a certain kind of quality, it'd be swell. Let me tell you it's not about "spoon feeding people who sound like wanna-be thieves". It's about preparing the ground for people who may have twice your experience and otherwise scoff at the project.

No offence - ofc.

Rico
staff
Activity: 4284
Merit: 8808
January 09, 2017, 06:50:01 AM
#9
I am also interested in a faster secp256k1.

Unfortunately, it seems Pieter Wuille et al. were neither serious about providing a fast secp256k1 nor about documenting it well.  Roll Eyes

Can you show another library for the same application within a factor of _five_ of the performance?  Or with more than 1/5th the documentation?

I'm just ... beside myself at your comment, it's not even insulting: it's just too absurd.

Any optimizations are very welcome - I'm sure sipa (the original author) would agree. He's a very nice guy and it's easy to contact him directly (probably best via IRC). Although, about the code used by the core,  he'll probably tell you that it's being maintained by other people now.
Uh, what?


In 5x52 asm I found 2-3% by simply doing what the cpu scheduler should be doing and issuing together the adds + muls (the cpu integer unit typically has one add and one mul unit and ideally we want to be using them at the same time).
Awesome! PRs welcome!

One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?
It is potentially patent encumbered for another couple years, restricting it to experimental use.

Quote
So the adcq+mulq are issued together to the mul and add units of the cpu respectively. I'm baffled on why the cpu scheduler wasn't already doing this but then again I do have an older cpu (core2 quad / 45nm) to play with - it might not be an issue with modern ones.
Generally newer cpus work better, but the performance is also more important on older (and in-order cpus like the atoms) since they're slower to begin with.
legendary
Activity: 2053
Merit: 1356
aka tonikt
January 08, 2017, 08:14:55 AM
#8
The original secp256k1 lib is a great piece of work,  but it surely can be improved even more.

Any optimizations are very welcome - I'm sure sipa (the original author) would agree. He's a very nice guy and it's easy to contact him directly (probably best via IRC). Although, about the code used by the core,  he'll probably tell you that it's being maintained by other people now.

Otherwise, just publish your improvements  here - even if not Bitcoin Core,  someone else will use them, trust me,  because time is money Smiley
legendary
Activity: 1708
Merit: 1049
January 08, 2017, 06:41:37 AM
#7
I am also interested in a faster secp256k1.

I've been tampering with the asm a bit and had some luck, especially in scalar performance by merging the various stages into one function, however do note that it's gcc only (clang had some issues with rsi/rdi use on the manually inlined function and to work around it one has to move rsi or rdi to some xmm register and restore it before the function ends - which adds a few cycles)

https://github.com/Alex-GR/secp256k1/blob/master/src/scalar_4x64_impl.h
https://github.com/Alex-GR/secp256k1/blob/master/src/field_5x52_asm_impl.h

In 5x52 asm I found 2-3% by simply doing what the cpu scheduler should be doing and issuing together the adds + muls (the cpu integer unit typically has one add and one mul unit and ideally we want to be using them at the same time).

For example:

/* d += a3 * b1 */
    "movq 8(%%rbx),%%rax\n"
    "mulq %%r13\n"
    "addq %%rax,%%rcx\n"
    "adcq %%rdx,%%r15\n"
    /* d += a2 * b2 */
    "movq 16(%%rbx),%%rax\n" <=== this must be moved upwards
    "mulq %%r12\n"

becomes:

/* d += a3 * b1 */
    "movq 8(%%rbx),%%rax\n"
    "mulq %%r13\n"
    "addq %%rax,%%rcx\n"
    "movq 16(%%rbx),%%rax\n"
    "adcq %%rdx,%%r15\n"
    /* d += a2 * b2 */
    "mulq %%r12\n"

So the adcq+mulq are issued together to the mul and add units of the cpu respectively. (edit apparently the mul+add are in the simd unit, in the integer part it's just 3 integer units waiting to do parallel work of any kind - except load/store, which is 2 at a time). I'm baffled on why the cpu scheduler wasn't already doing this but then again I do have an older cpu (core2 quad / 45nm) to play with - it might not be an issue with modern ones.

Also, of the three temp variables used for temp storage in the field asm, some can be eliminated by rearranging the code a bit and thus reducing a couple memory accesses.

(ps. I don't claim the code is safe - I've only used it for benchmarks and the builtin test)

One question I have regarding secp256k1 is whether endomorphism is safe, and if yes, shouldn't it be enabled in bitcoin builds if it's faster (benchmarks show that it is)?
legendary
Activity: 1120
Merit: 1037
฿ → ∞
January 08, 2017, 05:22:21 AM
#6
I am also interested in a faster secp256k1.

Unfortunately, it seems Pieter Wuille et al. were neither serious about providing a fast secp256k1 nor about documenting it well.  Roll Eyes

I have done some hacking to some pretty basic functions in there

https://bitcointalksearch.org/topic/m.17365068

- which alone made the LBC generator about 10% faster overall -
and later also changing the field_5x52 code in secp256k1_fe_set_b32 to

Code:
    r->n[0] = (uint64_t)a[31]
            | (uint64_t)a[30] << 8
            | (uint64_t)a[29] << 16
            | (uint64_t)a[28] << 24
            | (uint64_t)a[27] << 32
            | (uint64_t)a[26] << 40
            | (uint64_t)(a[25] & 0xF)  << 48;

    r->n[1] = (uint64_t)((a[25] >> 4) & 0xF)
            | (uint64_t)a[24] << 4
            | (uint64_t)a[23] << 12
            | (uint64_t)a[22] << 20
            | (uint64_t)a[21] << 28
            | (uint64_t)a[20] << 36
            | (uint64_t)a[19] << 44;

    r->n[2] = (uint64_t)a[18]
            | (uint64_t)a[17] << 8
            | (uint64_t)a[16] << 16
            | (uint64_t)a[15] << 24
            | (uint64_t)a[14] << 32
            | (uint64_t)a[13] << 40
            | (uint64_t)(a[12] & 0xF) << 48;

    r->n[3] = (uint64_t)((a[12] >> 4) & 0xF)
            | (uint64_t)a[11] << 4
            | (uint64_t)a[10] << 12
            | (uint64_t)a[9]  << 20
            | (uint64_t)a[8]  << 28
            | (uint64_t)a[7]  << 36
            | (uint64_t)a[6]  << 44;

    r->n[4] = (uint64_t)a[5]
            | (uint64_t)a[4] << 8
            | (uint64_t)a[3] << 16
            | (uint64_t)a[2] << 24
            | (uint64_t)a[1] << 32
            | (uint64_t)a[0] << 40;

I have been pointed to https://github.com/llamasoft/secp256k1_fast_unsafe but am not sure if that is still maintained/developed.
Is there any place where R&D discussion about further development is ongoing? Before I'd start reimplementing that mess from scratch I'd prefer to participate in some "official endeavor".

However, I have little hope that will make sense:

Quote
Signature verification isn't really the limiting factor in Bitcoin Core performance anymore in any case.

Together with other statements from gmaxwell @ github ("this is alpha/, don't expect other doc shan source.." - something like that from memory) I see there seems not much motivation in further development from the "official side".


Rico
Pages:
Jump to: