Regarding the theoretical maximum Performance of GPUs

CanaryInTheMine

donator

Activity: 2366

Merit: 1060

between a rock and a block!

Quote from: ArtForz on August 20, 2011, 01:59:55 PM

You realize a bitcoinhash is *2* sha256 blocks operations, right?
Well, not exactly 2 thanks to some optimizations possible
you can drop the last 3 rounds completely (they don't change H), and lose part of the previous round (you only need the E output of the 4th-to-last)
Initial rounds can be optimized as well, as the last DWORD of hMerkleRoot and nTime/nBits don't change between loops, so you can drop the equivalent of ~3 rounds there as well.
Same thing goes for optimizing/precalculating parts of the W mangling, as we're feeding in quite a bit of constants.
Register access is basically free on GPUs (they mask reg r/w by pipelining 4 "threads" on the shader pipeline).
Ch() can be done in 1 cycle, and Maj() in 2.
Also, what LUT accesses? just hardcode the K constants in the instruction stream.

So while you came up with a somewhat reasonable result, you did so by pure chance using invalid assumptions and numbers.

ArtForz, you are quite the legend on these forums... Glad to see you here!

ElectricMucus

legendary

Activity: 1666

Merit: 1057

Marketing manager - GO MP

Well thanks for the pointers, excuse my noobish rants. I'll be back once I understand what is being said Kiss

ArtForz

sr. member

Activity: 406

Merit: 257

You realize a bitcoinhash is *2* sha256 blocks operations, right?
Well, not exactly 2 thanks to some optimizations possible
you can drop the last 3 rounds completely (they don't change H), and lose part of the previous round (you only need the E output of the 4th-to-last)
Initial rounds can be optimized as well, as the last DWORD of hMerkleRoot and nTime/nBits don't change between loops, so you can drop the equivalent of ~3 rounds there as well.
Same thing goes for optimizing/precalculating parts of the W mangling, as we're feeding in quite a bit of constants.
Register access is basically free on GPUs (they mask reg r/w by pipelining 4 "threads" on the shader pipeline).
Ch() can be done in 1 cycle, and Maj() in 2.
Also, what LUT accesses? just hardcode the K constants in the instruction stream.

So while you came up with a somewhat reasonable result, you did so by pure chance using invalid assumptions and numbers.

ElectricMucus

legendary

Activity: 1666

Merit: 1057

Marketing manager - GO MP

Quote from: CanaryInTheMine on August 20, 2011, 01:24:27 PM

take a look at:

https://bitcointalksearch.org/topic/theoretical-limit-on-hashing-speed-33817

Thanks, thats nearly the same result, obviously I forgot some things architecture specific to the cards. Cool

CanaryInTheMine

donator

Activity: 2366

Merit: 1060

between a rock and a block!

take a look at:

https://bitcointalksearch.org/topic/theoretical-limit-on-hashing-speed-33817

ElectricMucus

legendary

Activity: 1666

Merit: 1057

Marketing manager - GO MP

I am interested in how effective the ALUs on a card can be utilized and did some calculations:

Considering the the SHA256 loop there are the following things in there:

Operations:
1 not, 5 and, 7 xor
6 rotations by 2, 13, 22 and 6, 11, 25

makes 19

32 Bit words. (register access)
5 A, 2 B, 5 E, 1 F, 1, G

makes 14

4 additions, 2 LUT accesses
8 Memory accesses, 2 extra additions

makes 16
---------
49 total
run 64 times
------
3136 cylces

5970 with 3200 ALUs:

3200*725/3136 = 739.795918 mhash

Is this calculation correct or is there more/less done on the gpu?

Because according to this the code utilization would be nearly optimal which makes claims of awesome optimizations dubious, (ArtForz entry on the wiki for ex...)

Topic: Regarding the theoretical maximum Performance of GPUs (Read 1021 times)