Author

Topic: Q: Hardware Acceleration, Memory Demand (Read 2816 times)

hero member
Activity: 728
Merit: 500
165YUuQUWhBz3d27iXKxRiazQnjEtJNG9g
April 09, 2013, 04:15:31 AM
#11
Please ignore most of my previous posting, I think I found the "missing link".

My mistake was to assume that a solution (of the form hash < target) must necessarily exist within the address space spanned by iterating through all possible values of the nonce in the block header. Of course, it is absolutely possible (and even likely) that no solution will exist within that address space, so that other parts of the block header need to be altered.

Correct.  Once you're out of nonce you bump the extranonce, generate a new input hash, and feed it to the miners.

Quote
Question: is there mathematical proof that double-SHA256(block_header) will output every possible value in the 2**256 hash address space, for every possible 2**640 block header input value?

No.  It's actually believed that there will be both gaps and duplicates in the 2**256 space even for a 2**256 input.  The overall the distribution is believed to be good but it's difficult (impossible?) to prove.

The term you're looking for:  https://en.wikipedia.org/wiki/Random_oracle
Lots more on that, including some commentary on SHA256 specifically: http://crypto.stackexchange.com/questions/879/what-is-the-random-oracle-model-and-why-is-it-controversial
newbie
Activity: 12
Merit: 0
April 07, 2013, 05:54:54 PM
#10
Please ignore most of my previous posting, I think I found the "missing link".

My mistake was to assume that a solution (of the form hash < target) must necessarily exist within the address space spanned by iterating through all possible values of the nonce in the block header. Of course, it is absolutely possible (and even likely) that no solution will exist within that address space, so that other parts of the block header need to be altered.

Question: is there mathematical proof that double-SHA256(block_header) will output every possible value in the 2**256 hash address space, for every possible 2**640 block header input value?

Please send your comments, let me know if I'm right or wrong.
newbie
Activity: 12
Merit: 0
April 07, 2013, 10:44:37 AM
#9
Wow, thanks bunches for this explanation! I guess I have to ponder over it for a while, to make sure that I fully understand it.

What I'd like know:

> The coinbase output contains a public key of the miner, thus, the coinbase hash will always be different for different miners.

Am I right to assume that by "miner" you mean the a running instance of a mining client (which would perhaps in most cases be just a single one)? ("instance" == image of the mining application in memory, as opposed to a thread spawned by the mining application.) This would mean that in a parallel system architecture, where every (hardware) execution pipeline would be assigned a separate miner thread (well, that's the way I'd do it), all execution pipelines would still share one common coinbase hash, hence the same extraNonce.

> When the header nonce rolls over, the extranonce is incremented.

This is one of the parts where I'm lost. If I start hashing on block header x (which remains static until a new getwork request is issued) with nonce 0, and keep hashing block header x until the nonce reaches its maximum value (2**21 - 1), the nonce will never roll over ... or would it? (If so, when, and why?)

Putting this and the part about the extraNonce together, it would mean that in a parallel system architecture, the extraNonce would never change without an explicit new getwork request. However, this would leave the entire address space, which is opened up by introducing the extraNonce, unused by that one particular miner client.

Alternatively, in such a system architecture, explicit measures would need to be taken in the code to allow for rollover of the nonce.

Please, let me know what you think, and correct me. I guess I'm still somewhat confused about how all the details of the algorithm work together.
legendary
Activity: 1526
Merit: 1134
April 07, 2013, 09:13:15 AM
#8
The extraNonce is a piece of data in the input of the coinbase transaction (that's an area where it's possible to stuff arbitrary data). When the header nonce rolls over, the extranonce is incremented. This changes the hash of the coinbase transaction which in turn changes the merkle root, which is in the header and thus the hash of the header changes as well. You can now scan the same nonce space again and get different hashes. The coinbase output contains a public key of the miner, thus, the coinbase hash will always be different for different miners.
newbie
Activity: 12
Merit: 0
April 07, 2013, 06:13:59 AM
#7
Again, thanks!

> It's a large enough space that you're unlikely to collide with yourself

This is exactly the part where I still don't get it -- sorry, maybe I don't see the forest for the trees, but ...

... after all, the nonce is only a 32-bit unsigned integer, i.e. the largest value it can take on is (dec)4,294,967,295. In a massively parallel architecture, this address space gets easily broken down to parts which don't look that intimidating at all. But even in a reasonably performant non-parallel architecture, it doesn't look that big.

In my understanding (which could be wrong, and probably is wrong), the values of the fields which can change (other than the nonce) are the timestamp, the coinbase transaction, and the list of transacations included in the block. However, these fields usually don't get updated while still working on the same block (header) of data (i.e., no new getwork request during that time). So, the only variable in the calculation while working on one block header is the nonce, while the values of all other fields can be treated as static. Is this correct?

Maybe I should add that I still don't fully understand the part about the extraNonce. To my knowledge, there's no such thing like an "extraNonce" field, so, what's the role/influence of the extraNonce while a device is hashing away through the nonce's address space? Some education on that would be highly welcome Smiley

Now, let's assume we have a hashing device with a true hashing performance of 400 MH/s, i.e. one that actually does 400 million complete double-SHA256 hash calculations (including minor stuff like incrementing the nonce, comparing the hash against target, etc.) in hardware _every_second_. Since the data in the block header doesn't change, the maximum number of calculations which need to be performed remains the same: 4,294,967,295, because that's the maximum number of possibilities in the nonce's address space. (In more statistical language: the block header portion of the input doesn't change the number of possible permutations, so the total number of permutations of the whole input is the same as the number of permutations of the nonce.) Consequently, that device would be able to calculate the results of all 4,294,967,295 possible permutations in a little over 10 seconds.

And this looks so so so completely wrong to me ... it just can't be! Because 400 MH/s is nothing special, even existing GPU miners deliver this performance quite easily.

So, please enlighten me -- where's the error in my reasoning?
hero member
Activity: 728
Merit: 500
165YUuQUWhBz3d27iXKxRiazQnjEtJNG9g
April 06, 2013, 07:15:25 PM
#6
You can start from a random nonce.  It's a large enough space that you're unlikely to collide with yourself unless you control a significant percentage of the global hashpower.

Yes, you can divide the start nonces evenly between your ASICs if you want to try to squeeze out the last fraction of a percent.  Smiley
newbie
Activity: 12
Merit: 0
April 06, 2013, 06:57:18 PM
#5
Thanks again,

I'm really kind of baffled -- if it's really THAT easy ...

Anyway, one more question:

> * get the data and target with getwork
> * upload it to the ASIC over SPI

Wouldn't it be necessary (or at least a good idea) to send also the nonce to the ASIC? I mean, in a parallel architecture you wouldn't want that all ASICs start their work with the same initial value of the nonce, but you'd rather want to split the "address space" (well, kind of, in lack of a better term) of the nonce in as many slices as you have ASICs in your system, and tell each ASIC to work on that particular address space.

Or am I mistaken?
hero member
Activity: 728
Merit: 500
165YUuQUWhBz3d27iXKxRiazQnjEtJNG9g
April 06, 2013, 06:13:16 PM
#4
It's actually easier than that.  Your software will just:

* get the data and target with getwork
* upload it to the ASIC over SPI
* wait for an interrupt
* read out the nonce over SPI
* return it with submitblock

Bitcoind does all the rest of the work: talking to the network, creating and distributing blocks, everything involving wallets, etc.
newbie
Activity: 12
Merit: 0
April 06, 2013, 05:30:15 PM
#3
Thanks a lot, this is very interesting information.

1) With regard to "Mining tools are specifically written to speak the protocol the ASIC units expect", how would that "usually" be done?

Since many mining clients appear to be written in Phython (well, this is my personal impression), let's stick to this as an example. Let's also assume we have an ASIC which "speaks" the UART (or I2C, or SPI, or ...) protocol, and is designed in such a way that all it does is read the 512-bit block header, current target, and nonce from a "load register", does its "double-SHA256 magic", compares the result (the computed hash) against the current target. If result > current target, the nonce is incremented by 1, and the double-SHA256 magic is done again. If result < current target, the result is written to a "store register" and the ASIC signals some kind of success flag.

In the given example, am I right to assume that it would be sufficient to enhance the code of the mining client so that it can communicate with the ASIC using said UART (I2C/SPI) protocol? If so, it's a trivial task, because this can be easily done with e.g. pyserial (or the MPSSE wrapper in case of I2C/SPI).

It would boil down to something like:
* mining client gets a block from the network (getwork or getblocktemplate)
* mining client strips the block header and the target from the block (plus some litte/big endian stuff)
* mining client writes the block header, the target, and the desired start value of the nonce to the ASIC's load register
* mining client signals the ASIC to start processing
* mining client waits for the ASIC to signal success
* mining client reads the result from the ASIC's store register, and sends it to the network

OK, I left out some exception handling stuff like stale block handling and such, but in principle ...

My problem: it's hard for me to believe that it's really that simple, so the question is, where am I wrong, and what am I missing?

2) full verification vs. simplified verification

> although existing implementations expect to be able to load the wallet into RAM

OK, do I understand it right: it's sufficient to keep a single instance of the wallet in RAM -- it's not required to keep an instance of the wallet in RAM _per_SHA256_worker_thread_ -- is this correct? If so, it shouldn't be a problem at all, since you get gigabytes of memory a dozen a dime (well, almost Wink )

Again, thanks a lot in advance for your help!
legendary
Activity: 1526
Merit: 1134
April 06, 2013, 03:41:26 PM
#2
Mining ASICs/FPGAs only handle the double SHA256 (and a few other things, like incrementing the nonce so they have data to hash).  Mining tools are specifically written to speak the protocol the ASIC units expect.

"The Bitcoin algorithm" is a large and complex set of algorithms that work together. Memory requirements depend on what exactly you're doing - you can use Bitcoin in two modes, full verification and simplified verification. In the latter memory requirements are low enough to run on low end smartphones, although existing implementations expect to be able to load the wallet into RAM, so if you receive lots of tiny payments your memory usage can end up being too high.
newbie
Activity: 12
Merit: 0
April 06, 2013, 03:30:59 PM
#1
Hi all,

I'm trying to delve into some technical aspects of Bitcoin, and there's still a lot that ranges anywhere from vague to completely unclear. I'm learning, so, if my questions have already been asked (and answered) before, seem stupid, or don't make any sense to you, please enlighten and bear with me.

1) How does integration of hardware acceleration (like e.g. FPGA or ASIC devices) into the "mining process" work ? How does the mining client become "aware" of such hardware acceleration ? In other words, what kind of "mechanism" would a hardware acceleration device need to provide to be recognized by a mining client ?

2) What kind of functionality (with regard to the Bitcoin algorithm) do such hardware acceleration devices usually provide ? Do they only speed up the double-SHA256 calculation, or do they (need to) implement other steps in the Bitcoin algorithm as well ? If so, which ones ? (It is self-understood that only such functionality should be done in hardware, which is supposed to never ever change.)

3) Which factors influence the memory demand of the Bitcoin algorithm ? Does, for instance, the continuously growing size of the blockchain have an effect on the amount of memory (RAM, not disk) that is required to run the Bitcoin algorithm ? Is it possible to calculate (or at least guesstimate) the maximum amount of memory (i.e. an upper threshold) that would never be exceeded when running the Bitcoin algorithm ?

I really hope all this doesn't sound just confusing. In advance, thank you all so much for your help.
Jump to: