Author

Topic: limits of ZEC mining (Read 10069 times)

sr. member
Activity: 588
Merit: 251
November 05, 2017, 09:38:40 AM
#62
Yesterday I realized I forgot to account for the 1:1 command:burst ratio.  Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.

I don't think this is correct.  See the diagram below from the JEDEC GDDR5 specification showing gapless reads from a single row; half the command slots are NOPs (particularly at time=T1).  These slots can be used to send ACTIVATE commands to other banks, so as long as you distribute your workload across the chip's banks (and observe that annoying tFAW or find a chip that doesn't really need it) you can indeed do totally random reads at the full pin bitrate and the only requirement is that the read size is at least 32bits*8wordburst=256bits=32bytes by using AUTO PRECHARGE to eliminate the explicit PRECHARGE command on the command bus.

In particular, the command:burst ratio is 2:1 for 8word bursts, not 1:1.  Maybe you missed the fact that the address pins are DDR in GDDR5?

I have implemented a system that does exactly this (stealing that NOP slot to ACTIVATE a different bank) on DDR2 and it works.  Granted DDR2 is not GDDR5, but the idea that an ACTIVATE-READ command pair ought to use the same amount of command bus time as the data it procures is something JEDEC has worked to preserve across many generations of memory with a wide variety of burst lengths and timings.  I don't think GDDR5 would depart from this lightly.

Everything I wrote in this post is about GDDR5 independently of any particular GPU or even all of the GPUs on the market taken together.  GPU memory controllers are not optimized for random/scattered reads like you find in most cryptocurrency mining PoWs.  I would not be surprised if no GPU is actually able to do scattered full-bandwidth reads at the minimum 256-bit granularity allowed by the GDDR5 spec; that's just not something that's a top priority for rendering video games.  





I've been focused on non-mining projects over the summer, but now I'm getting ready to fire up a couple rigs and get back into mining.
I looked into auto precharge back into the spring, and couldn't find anything in the timing straps that relate to auto precharge.  I'm guessing it is part of the memory controller firmware.
Also don't forget that with AMD GCN, each channel it 64-bits wide (2 GDDR5 chips), so the minimum random read is effectively 64 bytes.  And since there seems to be no way to control auto precharge, the minimum full-speed read is 128bytes.  Note that 128 bytes is exactly the size of a DAG read in ethash (ethereum).

Writes are more complicated, since Marc Bevand found that the GCN memory controller is capable of writing to just one of the two GDDR chips in a channel when one half of a cache line is not dirty.
jr. member
Activity: 74
Merit: 1
September 13, 2017, 10:19:28 PM
#61
Everything I wrote in this post is about GDDR5 independently of any particular GPU or even all of the GPUs on the market taken together.  GPU memory controllers are not optimized for random/scattered reads like you find in most cryptocurrency mining PoWs.  I would not be surprised if no GPU is actually able to do scattered full-bandwidth reads at the minimum 256-bit granularity allowed by the GDDR5 spec; that's just not something that's a top priority for rendering video games.  



I agree.

Graphics rendering mostly operate as embarrassingly parallel convolutions on texture/framebuffer patches.  Relatively random memory access (in VRAM) is much slower than on a CPU. 

In addition, I find that in my OpenCL projects, random reads can often fall back into serialized fetches which will stall the CU as it waits for a few threads in wavefront to catch up with the rest.
full member
Activity: 129
Merit: 100
September 13, 2017, 08:07:36 PM
#60
Yesterday I realized I forgot to account for the 1:1 command:burst ratio.  Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.

I don't think this is correct.  See the diagram below from the JEDEC GDDR5 specification showing gapless reads from a single row; half the command slots are NOPs (particularly at time=T1).  These slots can be used to send ACTIVATE commands to other banks, so as long as you distribute your workload across the chip's banks (and observe that annoying tFAW or find a chip that doesn't really need it) you can indeed do totally random reads at the full pin bitrate and the only requirement is that the read size is at least 32bits*8wordburst=256bits=32bytes by using AUTO PRECHARGE to eliminate the explicit PRECHARGE command on the command bus.

In particular, the command:burst ratio is 2:1 for 8word bursts, not 1:1.  Maybe you missed the fact that the address pins are DDR in GDDR5?

I have implemented a system that does exactly this (stealing that NOP slot to ACTIVATE a different bank) on DDR2 and it works.  Granted DDR2 is not GDDR5, but the idea that an ACTIVATE-READ command pair ought to use the same amount of command bus time as the data it procures is something JEDEC has worked to preserve across many generations of memory with a wide variety of burst lengths and timings.  I don't think GDDR5 would depart from this lightly.

Everything I wrote in this post is about GDDR5 independently of any particular GPU or even all of the GPUs on the market taken together.  GPU memory controllers are not optimized for random/scattered reads like you find in most cryptocurrency mining PoWs.  I would not be surprised if no GPU is actually able to do scattered full-bandwidth reads at the minimum 256-bit granularity allowed by the GDDR5 spec; that's just not something that's a top priority for rendering video games.  



full member
Activity: 129
Merit: 100
July 26, 2017, 11:03:24 AM
#59
A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.

Sorry about the necropost here, but isn't the dataset for ZEC more than 32MB?  You wrote:

Specifically, 2 million pseudo-random numbers are generated using blake2b (see http://blake2.net/).  Each of these numbers is 200 bits (25 bytes), and they are sorted

200bits*2million = 400mbits = 50mbytes, right?

If I got the math wrong, or if there is some trick for halving the storage requirements, I would be happy to be corrected.  Thanks!
newbie
Activity: 16
Merit: 0
April 25, 2017, 08:55:12 PM
#58
So it's time for nvidia.  1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!

You cant get 720 SOL from a 1080, you must mean the 1080 Ti or Titan X Pascal.
full member
Activity: 224
Merit: 100
CryptoLearner
March 21, 2017, 01:25:59 PM
#57
So it's time for nvidia.  1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!

I'm consistently getting 420 sols at 150watts on 1070s with EWBF (as reported by nvidia-smi)

When I did my spreadsheet it looked like nvidia was the better deal for ROI on ZEC. Don't have any RADEONs to compare to, though, right now.


Nvidia is definitely better than AMD on everything, except from ETHash

I don't have NV cards, but XMR doesn't seem to be their favorite either from hashrates I saw. Am I wrong?

It's more like the XMR miners hasn't been updated in quite some time, since XMR hasn't been profitable for nvidias card for half a year at least if not ever.
full member
Activity: 192
Merit: 100
March 21, 2017, 11:28:03 AM
#56
So it's time for nvidia.  1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!

I'm consistently getting 420 sols at 150watts on 1070s with EWBF (as reported by nvidia-smi)

When I did my spreadsheet it looked like nvidia was the better deal for ROI on ZEC. Don't have any RADEONs to compare to, though, right now.


Nvidia is definitely better than AMD on everything, except from ETHash

I don't have NV cards, but XMR doesn't seem to be their favorite either from hashrates I saw. Am I wrong?
hero member
Activity: 653
Merit: 500
March 21, 2017, 09:54:57 AM
#55
So it's time for nvidia.  1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!

I'm consistently getting 420 sols at 150watts on 1070s with EWBF (as reported by nvidia-smi)

When I did my spreadsheet it looked like nvidia was the better deal for ROI on ZEC. Don't have any RADEONs to compare to, though, right now.


Nvidia is definitely better than AMD on everything, except from ETHash
newbie
Activity: 54
Merit: 0
March 21, 2017, 09:19:35 AM
#54
So it's time for nvidia.  1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!

I'm consistently getting 420 sols at 150watts on 1070s with EWBF (as reported by nvidia-smi)

When I did my spreadsheet it looked like nvidia was the better deal for ROI on ZEC. Don't have any RADEONs to compare to, though, right now.
sr. member
Activity: 728
Merit: 256
NemosMiner-v3.8.1.3
March 21, 2017, 02:52:10 AM
#53
So it's time for nvidia.  1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!
sr. member
Activity: 588
Merit: 251
March 20, 2017, 07:12:48 PM
#52

Why is it MY problem? Tongue

I'm a bit busy with VBIOS mods at this point. Figuring out what timings are causing slowdowns in algos like Eth is really fun and interesting. For example, once tRRD was lowered on the strap I was using, I gained half an MH/s (scales with clock.) Taking a lower strap entirely, however, will crash it. I've found tRFC to be WAAAAY too damn high, as well - this has somewhat less of an effect since DRAM is only refreshed so often, but it also helps. Writing tools to work with this shit also is taking up time.

Just looking back at this now that I've finally written a strap decoder. Are you sure you're seeing tRFC too high?  I think tRFC too low might cause a bit of a performance hit mining since no commands can be processed during the refresh.

"The REFRESH command is used during normal operation of the GDDR5 SGRAM. The command is non
persistent, so it must be issued each time a refresh is required. A minimum time tRFC is required between
two REFRESH commands. The same rule applies to any access command after the refresh operation. All
banks must be precharged prior to the REFRESH command.
The refresh addressing is generated by the internal refresh controller. This makes the address bits ʺDonʹt
Careʺ during a REFRESH command. The GDDR5 SGRAM requires REFRESH cycles at an average peri‐
odic interval of tREFI(max). The values of tREFI for different densities are listed in Table 6. To allow for
improved efficiency in scheduling and switching between tasks, some flexibility in the absolute refresh
interval is provided. A maximum of eight REFRESH commands can be posted to the GDDR5 SGRAM, and
the maximum absolute interval between any REFRESH command and the next REFRESH command is 9 *
tREFI."
sr. member
Activity: 588
Merit: 251
December 27, 2016, 09:32:21 AM
#51
After looking at some of zawawa's miner code, I realized using 16-byte data structures instead of 32 for the last few rounds would reduce the average number of core cycles required from 5 million per round to 4.5.  Also since the current parameters result in an average of 1.9 solutions per equihash itteration, with early pruning of duplicates the amount of entries per round in the final rounds will be closer to 1.9 million than the 2 million I used in my calculations.  These tweaks together means the upper limit for an OpenCl implementation on a Rx 470 (1250/1750 clocks) would be around 220sol/s, with the core clock being the primary limiting factor.  It is also possible to optimize the final round where the collision search is 40 bits instead of 20, and this could push the performance limit close to 230 sols/s.

As I've previously explained, the memory clock (and timing) would be the primary limiting factor in a GCN assembler implementation.  Based on my recent memory benchmarking, the performance limit would be ~280 sol's on a Rx 470.

I've also looked for shortcuts to the equihash algorithm and remain convinced that no significant (i.e. doubling of speed) optimizations can be made to the algorithm itself.
sr. member
Activity: 588
Merit: 251
December 09, 2016, 01:59:32 PM
#50
I just finished doing some benchmarking, and have discovered that sequential write performance on AMD is 25-35% slower than sequential read performance.  Read performance is close to expectations for Hawaii and Tonga, achieving ~ 90% of GDDR5 peak bandwidth.  Pitcairn performance is about 10% slower, which suggests its memory controller is less sophisticated than the later GCN parts.
I'll publish my benchmark code after a bit more cleanup.  In the mean time I can recommend another benchmark tool for testing the read bandwidth: clpeak.
https://github.com/krrishnarraj/clpeak

My benchmark reports read bandwidth that agrees with clpeak, so the write bandwidth numbers I'm seeing should be just as reliable.

edit: benchmark code posted to github.  Detailed writeup to follow.
https://github.com/nerdralph/cl-mem
sr. member
Activity: 588
Merit: 251
November 26, 2016, 03:20:48 PM
#49
While my initial analysis was focused on the external GDDR5 bandwidth limits, current ZEC GPU mining software seems to be limited by the memory controller/core bus.  On AMD GCN, each memory controller can xfer 64 bytes (1 cache line) per clock.  In SA5, the ht_store function, in addition to adding to row counters, does 4 separate memory writes for most rounds (3 writes for the last couple rounds).  All of these writes are either 4 or 8 bytes, so much less than 64 bytes per clock are being transferred to the L2 cache.  A single thread (1 SIMD element) can transfer at most 16 bytes (dwordX4) in a single instruction.  This means a modified ht_store thread could update a row slot in 2 clocks.  If the update operation is split between 2 (or 4 or more) threads, one slot can be updated in one clock, since 2 threads can simultaneously write to different parts of the same 64-byte block.  This would mean each row update operation could be done in 2 GPU core clock cycles; one for the counter update, and one for updating the row slot.

Even with those changes, my calculations indicate that a ZEC miner would be limited by the core clock, according to a ratio of approximately 5:6.  In other words, when a Rx 470 has a memory clock of 1750Mhz, the core would need to be clocked at 1750 * 5/6 = 1458Mhz in order to achieve maximum performance.

If the row counters can be kept in LDS or GDS, the core:memory ratio required would be 1:2, thereby allowing full use of the external memory bandwidth.  There is 64KB of LDS per CU, and the AMD GCN architecture docs indicate the LDS can be globally addressed; i.e. one CU can access the LDS of another CU.  However the syntax of OpenCL does not permit the local memory of one work-group to be accessed by a different work-group.  There is only 64KB of GDS shared by all CUs, and even if the row counters could be stored in such a small amount of memory, OpenCL does not have any concept of GDS.

This likely means writing a top performance ZEC miner for AMD is the domain of someone who codes in GCN assembler.  Canis lupus?


This is what I was trying to achieve in that snippet I sent you a while ago, the coalesced write thing. I just lacked all the theory behind it Grin

I tried a simple mod to ht_store to try to reduce the number of writes; flat_store_dwordx3 and flat_store_dwordx4.  I couldn't use a straight uint3 vector type, because a uint3 pointer has to be 16-byte alined:
http://www.informit.com/articles/article.aspx?p=1732873&seqNum=3

Structs only have to be aligned to the largest type, so using this compiles and executes OK:
Code:
typedef struct
{
    uint3   uints;
} ui3;

    ui3 buf1;
    buf1.uints.s0 = i;
    buf1.uints.s1 = xi0;
    buf1.uints.s2 = xi0>>32;
    // store "i" (always 4 bytes before Xi) and xi0
    *(__global ui3 *)(p - 4) = buf1;

However the generated code is one flat_store_dword and one flat_store_dwordx2.  I tried 3 individual uints in the struct, and that resulted in 3 flat_store_dword instructions.
sr. member
Activity: 449
Merit: 251
November 24, 2016, 02:08:33 PM
#48
CM 8.0 getting 187Sol/s on 480 4GB (7Gbps), and 176 on 470 4GB (7Gbps), Stock clocks, so it seems 173 is not the theoretical limit.

Correct.  See my Nov 20 post about data written to half a cache line.  That puts the theoretical limit up around 200 for a OpenCL implementation, with the 1250Mhz core clock being the limiting factor rather (described in my Nov 22 post).
My Sapphires can go upto 1406 and 2250. Will that make any difference?
The 480 8G does close to 200 with 1300 and 2000.  Some report as high as 215-220, but I don't like to push core/memory too much, want cards to last a long while, so I just undervolt, and change memory straps.  Memory straps less  of a boost the more optimized miner is it seems. I have mostly 470s though, doing 175-185.
sr. member
Activity: 652
Merit: 266
November 24, 2016, 12:10:21 PM
#47
CM 8.0 getting 187Sol/s on 480 4GB (7Gbps), and 176 on 470 4GB (7Gbps), Stock clocks, so it seems 173 is not the theoretical limit.

Correct.  See my Nov 20 post about data written to half a cache line.  That puts the theoretical limit up around 200 for a OpenCL implementation, with the 1250Mhz core clock being the limiting factor rather (described in my Nov 22 post).
My Sapphires can go upto 1406 and 2250. Will that make any difference?
sr. member
Activity: 588
Merit: 251
November 24, 2016, 12:03:30 PM
#46
CM 8.0 getting 187Sol/s on 480 4GB (7Gbps), and 176 on 470 4GB (7Gbps), Stock clocks, so it seems 173 is not the theoretical limit.

Correct.  See my Nov 20 post about data written to half a cache line.  That puts the theoretical limit up around 200 for a OpenCL implementation, with the 1250Mhz core clock being the limiting factor rather (described in my Nov 22 post).
sr. member
Activity: 449
Merit: 251
November 24, 2016, 09:56:33 AM
#45
I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated.  If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%.  It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it.  I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements.  I doubt anyone will find a serious mistake now.  Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga.  That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes.  With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record.  A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.


Yeah I was basing on 480 4GB, since 470 seems somewhat compute limited on CM 7.0.  Since it is around the theoretical max, I am thinking there may be ways to get a bit more, some tricks to save some bandwidth, but of course we can't know since we are not CM.  He says he figures 300H is possible on 390X, which currently gets 230, but it will take a lot more work.

https://bitcointalksearch.org/topic/m.16924145

300 for a 390x is 1.73x the 173 limit I calculated for a Rx 470.  The 384GB/s memory bandwidth of the 390x is 1.71x the 224GB/s memory bandwidth of the Rx 470.  I'd say that's not a coincidence, and Claymore has come to the same conclusions that I have about the limits of ZEC mining performance.


Hmm, I was thinking 390x had 320GB/s memory bandwidth, I guess that's the 290x.  It seemed right since 390x isn't much faster than 480 at Ethereum.  Its bout 27% faster Ethereum, and 27% faster Ethash (as of CM 7.0, and many revisions before, this is general trend).  Both algorithms are affected by timings, though less so with newer CM versions.  Your estimate seems reasonable, I am just speculating. I don't do side projects programming these days, developing hand issues, so I can just speculate Sad

CM 8.0 getting 187Sol/s on 480 4GB (7Gbps), and 176 on 470 4GB (7Gbps), Stock clocks, so it seems 173 is not the theoretical limit.
sr. member
Activity: 588
Merit: 251
November 23, 2016, 12:33:23 PM
#44
I'm a bit busy with VBIOS mods at this point. Figuring out what timings are causing slowdowns in algos like Eth is really fun and interesting. For example, once tRRD was lowered on the strap I was using, I gained half an MH/s (scales with clock.) Taking a lower strap entirely, however, will crash it. I've found tRFC to be WAAAAY too damn high, as well - this has somewhat less of an effect since DRAM is only refreshed so often, but it also helps. Writing tools to work with this shit also is taking up time.

That seems peculiar.  A 128-byte DAG read is 2 cache lines, and would take 4 clocks (2 clocks/burst).  If tRRD is 4, then the ACTIVATE command rate can keep up to the DAG reads.  If it is 5, then you'd have a 20% performance hit.  Maybe a bit less since 1 of every 8 ACTIVATE commands should be to the same bank rather than a different one.  What offset is tRRD anyway?

edit: after giving this some more thought, I think I might know what's going on.  If tRRD is 5, with a large queue of read requests, the memory controller could group multiple page open commands to the same bank, so that most of the time tRRD is not a factor.  It still needs to read from all 8 banks, so occasionally it has to wait 5 clocks instead of 4 for an ACTIVATE.
hero member
Activity: 751
Merit: 517
Fail to plan, and you plan to fail.
November 23, 2016, 01:36:48 AM
#43
Writing tools to work with this shit also is taking up time.

Eagerly Awaited, Good Sir.
hero member
Activity: 924
Merit: 1000
November 22, 2016, 07:39:50 PM
#42
While my initial analysis was focused on the external GDDR5 bandwidth limits, current ZEC GPU mining software seems to be limited by the memory controller/core bus.  On AMD GCN, each memory controller can xfer 64 bytes (1 cache line) per clock.  In SA5, the ht_store function, in addition to adding to row counters, does 4 separate memory writes for most rounds (3 writes for the last couple rounds).  All of these writes are either 4 or 8 bytes, so much less than 64 bytes per clock are being transferred to the L2 cache.  A single thread (1 SIMD element) can transfer at most 16 bytes (dwordX4) in a single instruction.  This means a modified ht_store thread could update a row slot in 2 clocks.  If the update operation is split between 2 (or 4 or more) threads, one slot can be updated in one clock, since 2 threads can simultaneously write to different parts of the same 64-byte block.  This would mean each row update operation could be done in 2 GPU core clock cycles; one for the counter update, and one for updating the row slot.

Even with those changes, my calculations indicate that a ZEC miner would be limited by the core clock, according to a ratio of approximately 5:6.  In other words, when a Rx 470 has a memory clock of 1750Mhz, the core would need to be clocked at 1750 * 5/6 = 1458Mhz in order to achieve maximum performance.

If the row counters can be kept in LDS or GDS, the core:memory ratio required would be 1:2, thereby allowing full use of the external memory bandwidth.  There is 64KB of LDS per CU, and the AMD GCN architecture docs indicate the LDS can be globally addressed; i.e. one CU can access the LDS of another CU.  However the syntax of OpenCL does not permit the local memory of one work-group to be accessed by a different work-group.  There is only 64KB of GDS shared by all CUs, and even if the row counters could be stored in such a small amount of memory, OpenCL does not have any concept of GDS.

This likely means writing a top performance ZEC miner for AMD is the domain of someone who codes in GCN assembler.  Canis lupus?


Why is it MY problem? Tongue

I'm a bit busy with VBIOS mods at this point. Figuring out what timings are causing slowdowns in algos like Eth is really fun and interesting. For example, once tRRD was lowered on the strap I was using, I gained half an MH/s (scales with clock.) Taking a lower strap entirely, however, will crash it. I've found tRFC to be WAAAAY too damn high, as well - this has somewhat less of an effect since DRAM is only refreshed so often, but it also helps. Writing tools to work with this shit also is taking up time.

I did this for about 2 months with the radeon 480 cards.   Was a very time consuming process that did yield some interesting results.

I am now looking into miner development.  I think there is a bigger picture here, but that is for private discussion.
sr. member
Activity: 438
Merit: 250
November 22, 2016, 04:46:40 PM
#41
While my initial analysis was focused on the external GDDR5 bandwidth limits, current ZEC GPU mining software seems to be limited by the memory controller/core bus.  On AMD GCN, each memory controller can xfer 64 bytes (1 cache line) per clock.  In SA5, the ht_store function, in addition to adding to row counters, does 4 separate memory writes for most rounds (3 writes for the last couple rounds).  All of these writes are either 4 or 8 bytes, so much less than 64 bytes per clock are being transferred to the L2 cache.  A single thread (1 SIMD element) can transfer at most 16 bytes (dwordX4) in a single instruction.  This means a modified ht_store thread could update a row slot in 2 clocks.  If the update operation is split between 2 (or 4 or more) threads, one slot can be updated in one clock, since 2 threads can simultaneously write to different parts of the same 64-byte block.  This would mean each row update operation could be done in 2 GPU core clock cycles; one for the counter update, and one for updating the row slot.

Even with those changes, my calculations indicate that a ZEC miner would be limited by the core clock, according to a ratio of approximately 5:6.  In other words, when a Rx 470 has a memory clock of 1750Mhz, the core would need to be clocked at 1750 * 5/6 = 1458Mhz in order to achieve maximum performance.

If the row counters can be kept in LDS or GDS, the core:memory ratio required would be 1:2, thereby allowing full use of the external memory bandwidth.  There is 64KB of LDS per CU, and the AMD GCN architecture docs indicate the LDS can be globally addressed; i.e. one CU can access the LDS of another CU.  However the syntax of OpenCL does not permit the local memory of one work-group to be accessed by a different work-group.  There is only 64KB of GDS shared by all CUs, and even if the row counters could be stored in such a small amount of memory, OpenCL does not have any concept of GDS.

This likely means writing a top performance ZEC miner for AMD is the domain of someone who codes in GCN assembler.  Canis lupus?


This is what I was trying to achieve in that snippet I sent you a while ago, the coalesced write thing. I just lacked all the theory behind it Grin
sr. member
Activity: 728
Merit: 304
Miner Developer
November 22, 2016, 03:53:07 PM
#40
This is juicy stuff! Thanks, nerdralph!
sr. member
Activity: 588
Merit: 251
November 22, 2016, 03:15:08 PM
#39
While my initial analysis was focused on the external GDDR5 bandwidth limits, current ZEC GPU mining software seems to be limited by the memory controller/core bus.  On AMD GCN, each memory controller can xfer 64 bytes (1 cache line) per clock.  In SA5, the ht_store function, in addition to adding to row counters, does 4 separate memory writes for most rounds (3 writes for the last couple rounds).  All of these writes are either 4 or 8 bytes, so much less than 64 bytes per clock are being transferred to the L2 cache.  A single thread (1 SIMD element) can transfer at most 16 bytes (dwordX4) in a single instruction.  This means a modified ht_store thread could update a row slot in 2 clocks.  If the update operation is split between 2 (or 4 or more) threads, one slot can be updated in one clock, since 2 threads can simultaneously write to different parts of the same 64-byte block.  This would mean each row update operation could be done in 2 GPU core clock cycles; one for the counter update, and one for updating the row slot.

Even with those changes, my calculations indicate that a ZEC miner would be limited by the core clock, according to a ratio of approximately 5:6.  In other words, when a Rx 470 has a memory clock of 1750Mhz, the core would need to be clocked at 1750 * 5/6 = 1458Mhz in order to achieve maximum performance.

If the row counters can be kept in LDS or GDS, the core:memory ratio required would be 1:2, thereby allowing full use of the external memory bandwidth.  There is 64KB of LDS per CU, and the AMD GCN architecture docs indicate the LDS can be globally addressed; i.e. one CU can access the LDS of another CU.  However the syntax of OpenCL does not permit the local memory of one work-group to be accessed by a different work-group.  There is only 64KB of GDS shared by all CUs, and even if the row counters could be stored in such a small amount of memory, OpenCL does not have any concept of GDS.

This likely means writing a top performance ZEC miner for AMD is the domain of someone who codes in GCN assembler.  Canis lupus?
sr. member
Activity: 588
Merit: 251
November 20, 2016, 11:18:41 AM
#38
You are incorrect.  Cache lines are 64-bytes long because AMD memory channels are 64-bits wide (i.e. 2 DDR5 chips).  The GCN memory controller is not 32-bits wide, with 2 consecutive bursts to fill a cache line.
"Each memory controller is 64-bits wide and composed of two independent 32-bit GDDR5 memory channels." - pg 10:
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

I read tons and tons of docs (including this whitepaper), but somehow missed that one line. Ok. Misconception clarified Smiley

Marc did some testing following this post, and determined that while reads result in 32 bytes being read from the two GDDR5 memory channels to fill a 64-byte cache line, writes are different.  When data is only written to half of a cache line (32 bytes), due to the dirty byte mask the controller knows only one of the 2 GDDR5 memory channels is affected, and so will only write to one of them.  However this does not mean the write bandwidth is double what I originally calculated, as writing 2 32-byte chunks of memory to the memory controller requires 2 core clocks.   This would require the GPU core to be clocked the same as the memory, i.e. 2Ghz for a Rx 480 with 8Gbps RAM.  This is due to the core:memory clock ratio limit I described here:
https://bitcointalksearch.org/topic/amd-corememory-clock-ratio-fundamental-limit-05-1682003
sr. member
Activity: 449
Merit: 251
November 19, 2016, 10:35:18 PM
#37
I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated.  If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%.  It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it.  I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements.  I doubt anyone will find a serious mistake now.  Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga.  That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes.  With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record.  A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.


Yeah I was basing on 480 4GB, since 470 seems somewhat compute limited on CM 7.0.  Since it is around the theoretical max, I am thinking there may be ways to get a bit more, some tricks to save some bandwidth, but of course we can't know since we are not CM.  He says he figures 300H is possible on 390X, which currently gets 230, but it will take a lot more work.

https://bitcointalksearch.org/topic/m.16924145

300 for a 390x is 1.73x the 173 limit I calculated for a Rx 470.  The 384GB/s memory bandwidth of the 390x is 1.71x the 224GB/s memory bandwidth of the Rx 470.  I'd say that's not a coincidence, and Claymore has come to the same conclusions that I have about the limits of ZEC mining performance.


Hmm, I was thinking 390x had 320GB/s memory bandwidth, I guess that's the 290x.  It seemed right since 390x isn't much faster than 480 at Ethereum.  Its bout 27% faster Ethereum, and 27% faster Ethash (as of CM 7.0, and many revisions before, this is general trend).  Both algorithms are affected by timings, though less so with newer CM versions.  Your estimate seems reasonable, I am just speculating. I don't do side projects programming these days, developing hand issues, so I can just speculate Sad
sr. member
Activity: 588
Merit: 251
November 19, 2016, 09:01:14 PM
#36
I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated.  If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%.  It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it.  I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements.  I doubt anyone will find a serious mistake now.  Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga.  That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes.  With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record.  A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.


Yeah I was basing on 480 4GB, since 470 seems somewhat compute limited on CM 7.0.  Since it is around the theoretical max, I am thinking there may be ways to get a bit more, some tricks to save some bandwidth, but of course we can't know since we are not CM.  He says he figures 300H is possible on 390X, which currently gets 230, but it will take a lot more work.

https://bitcointalksearch.org/topic/m.16924145

300 for a 390x is 1.73x the 173 limit I calculated for a Rx 470.  The 384GB/s memory bandwidth of the 390x is 1.71x the 224GB/s memory bandwidth of the Rx 470.  I'd say that's not a coincidence, and Claymore has come to the same conclusions that I have about the limits of ZEC mining performance.
sr. member
Activity: 449
Merit: 251
November 19, 2016, 04:59:17 PM
#35
Yesterday I realized I forgot to account for the 1:1 command:burst ratio.  Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.  It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes.  About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration.  That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio.  Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.
Claymore 7.0 I get around 155 Sol/s on 470 4GB, and 164 Sol/s on 480 4GB, default clocks.  So either V7 is at/above theoretical limit, there are additional tricks to get speed, or theoretical limit is actually higher.

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated.  If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%.  It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it.  I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements.  I doubt anyone will find a serious mistake now.  Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga.  That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes.  With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record.  A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.


Yeah I was basing on 480 4GB, since 470 seems somewhat compute limited on CM 7.0.  Since it is around the theoretical max, I am thinking there may be ways to get a bit more, some tricks to save some bandwidth, but of course we can't know since we are not CM.  He says he figures 300H is possible on 390X, which currently gets 230, but it will take a lot more work.

https://bitcointalksearch.org/topic/m.16924145
sr. member
Activity: 588
Merit: 251
November 19, 2016, 11:34:56 AM
#34
Yesterday I realized I forgot to account for the 1:1 command:burst ratio.  Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.  It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes.  About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration.  That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio.  Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.
Claymore 7.0 I get around 155 Sol/s on 470 4GB, and 164 Sol/s on 480 4GB, default clocks.  So either V7 is at/above theoretical limit, there are additional tricks to get speed, or theoretical limit is actually higher.

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated.  If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%.  It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it.  I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements.  I doubt anyone will find a serious mistake now.  Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga.  That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes.  With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record.  A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.
sr. member
Activity: 449
Merit: 251
November 19, 2016, 09:16:52 AM
#33
rx480 have 8Gbs memory which is fater compared to rx470 so speed is not over theoretical, 480 should be faster then 470 on default
Both the 470 and 480 have 4GB and 8GB variants.  8GB 480 gets 180 Sol/s.  4GB cards have 6.6-7 GHz effective speed, 8GB typically 8GHz, but some (MSI 470 ...), manufactures cheaped out.  Ethereum, 470 and 480 are nearly identical, given the same memory.

Sapphire Nitro cards (7 or 8GHz), this is what I get

470 4GB 155
470 8GB 160

480 4GB 165
480 8GB 178

All have memory strap mod, but default memory clock, so since the theory discussed is only based on bandwidth, it shouldn't affect the limit.
legendary
Activity: 1901
Merit: 1024
November 18, 2016, 08:25:21 PM
#32
rx480 have 8Gbs memory which is fater compared to rx470 so speed is not over theoretical, 480 should be faster then 470 on default
sr. member
Activity: 449
Merit: 251
November 18, 2016, 08:17:17 PM
#31
p.s. I also have another idea that should work on 4GB cards.  The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion.  This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record.  This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones.  This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration.  That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second.  Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.

Yesterday I realized I forgot to account for the 1:1 command:burst ratio.  Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.  It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes.  About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration.  That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio.  Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.
Claymore 7.0 I get around 155 Sol/s on 470 4GB, and 164 Sol/s on 480 4GB, default clocks.  So either V7 is at/above theoretical limit, there are additional tricks to get speed, or theoretical limit is actually higher.
sr. member
Activity: 588
Merit: 251
November 18, 2016, 09:59:31 AM
#30
p.s. I also have another idea that should work on 4GB cards.  The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion.  This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record.  This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones.  This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration.  That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second.  Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.

Yesterday I realized I forgot to account for the 1:1 command:burst ratio.  Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.  It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes.  About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration.  That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio.  Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.
sr. member
Activity: 588
Merit: 251
November 15, 2016, 01:07:21 PM
#29
So, looks like Claymore v6 reached the limit?

I was just about to post that, as I'm getting ~145 sol on a Rx 470 clocked at 1250/1750.  I expect it will take much longer now to reach the 225 sols limit for 64-byte IO optimization.

newbie
Activity: 35
Merit: 0
November 15, 2016, 01:02:27 PM
#28
So, looks like Claymore v6 reached the limit?
mrb
legendary
Activity: 1512
Merit: 1028
November 15, 2016, 12:57:50 PM
#27
You are incorrect.  Cache lines are 64-bytes long because AMD memory channels are 64-bits wide (i.e. 2 DDR5 chips).  The GCN memory controller is not 32-bits wide, with 2 consecutive bursts to fill a cache line.
"Each memory controller is 64-bits wide and composed of two independent 32-bit GDDR5 memory channels." - pg 10:
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

I read tons and tons of docs (including this whitepaper), but somehow missed that one line. Ok. Misconception clarified Smiley
sr. member
Activity: 588
Merit: 251
November 15, 2016, 12:18:49 PM
#26
The reason is that AMD memory channels are 64-bits wide, and GDDR5 transfers a minimum of an 8-bit burst, so AMD GPUs transfer a minimum of 64 bytes of data to RAM at a time.  In addition, if the GPU kernel writes less than 64 contiguous bytes at a time, the memory controller will read 64 bytes, modify some of the bytes, and then write 64 back to RAM.

This is actually incorrect. The GDDR4-5 channels are 32 bits wide, so the data granularity is 32 bytes. However, cache lines are 64 bytes long, so in practice you often have 2 consecutive bursts, hence 64 bytes read or written.

And for HBM memory (R9 Nano & R9 Fury) the channels are 128 bits wide, however the burst length is reduced to 2, maintaining the same granularity of 32 bytes.

You are incorrect.  Cache lines are 64-bytes long because AMD memory channels are 64-bits wide (i.e. 2 DDR5 chips).  The GCN memory controller is not 32-bits wide, with 2 consecutive bursts to fill a cache line.
"Each memory controller is 64-bits wide and composed of two independent 32-bit GDDR5 memory channels." - pg 10:
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

edit: see also slide #34
http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah
mrb
legendary
Activity: 1512
Merit: 1028
November 15, 2016, 11:28:34 AM
#25
The reason is that AMD memory channels are 64-bits wide, and GDDR5 transfers a minimum of an 8-bit burst, so AMD GPUs transfer a minimum of 64 bytes of data to RAM at a time.  In addition, if the GPU kernel writes less than 64 contiguous bytes at a time, the memory controller will read 64 bytes, modify some of the bytes, and then write 64 back to RAM.

This is actually incorrect. The GDDR4-5 channels are 32 bits wide, so the data granularity is 32 bytes. However, cache lines are 64 bytes long, so in practice you often have 2 consecutive bursts, hence 64 bytes read or written.

And for HBM memory (R9 Nano & R9 Fury) the channels are 128 bits wide, however the burst length is reduced to 2, maintaining the same granularity of 32 bytes.
sr. member
Activity: 588
Merit: 251
November 15, 2016, 10:51:00 AM
#24
On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?

I still have some work to do before I write my own miner from scratch.  I like to *really* understand the problem before I start writing code, and there's still some parts of the GCN architecture that I'm figuring out.


Hi, nerdralph! Would you be interested in going after my Cuckoo Cycle bounties?

https://github.com/tromp/cuckoo

After a quick scan of your readme, it doesn't look appealing.  While it might be fun, it doesn't seem to have any practical application in a popular cryptocurrency like BTC, ETH, or ZEC.
sr. member
Activity: 588
Merit: 251
November 15, 2016, 10:40:02 AM
#23
Do you think, you can start developing miners for Nvidia?
There is a lack of skilled and honest developers.

Kind of hard when I don't have any Nvidia cards.  Even if I did, I'd stick to OpenCL, as I have no interest in learning CUDA.


even if the potential might be higher? or maybe you know already that this will be never the case?

Based on my limited knowledge of Nvidia GPUs, I think they have 32-byte cache lines, which should give them an advantage for equihash.
legendary
Activity: 1000
Merit: 1120
November 15, 2016, 09:21:20 AM
#22
On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?

I still have some work to do before I write my own miner from scratch.  I like to *really* understand the problem before I start writing code, and there's still some parts of the GCN architecture that I'm figuring out.


Hi, nerdralph! Would you be interested in going after my Cuckoo Cycle bounties?

https://github.com/tromp/cuckoo
legendary
Activity: 1498
Merit: 1030
November 15, 2016, 05:56:21 AM
#21
And the power usage on ETH for the 1070 vs the RX 480 is also very similar - pretty much a dead heat on a hash/watt basis.

 Unfortunately for ETH or ZEC miners the 1070 is almost twice the cost of the RX480 while not offering comparable hash/$.



full member
Activity: 243
Merit: 105
November 15, 2016, 02:54:28 AM
#20
Do you think, you can start developing miners for Nvidia?
There is a lack of skilled and honest developers.

Kind of hard when I don't have any Nvidia cards.  Even if I did, I'd stick to OpenCL, as I have no interest in learning CUDA.


No difference in speed between cuda and openCL implementations of silentarmy. Cuda can take advantage in computation algo, where its inline assembly can be used(LOP3 and other). In memory hard algo it does not matter use cuda or opencl.

I'm getting on hard overclocked 1070(samsung memory) ~590 s/s from 6 cards, it is near 97-98 from card. eXtremal got 90 s/s from rx480 bios tuned.
I don't have 480 , but have 470, in etherium they get 27 M/H, while overcloked 1070 with samsung memory - 31-32 MH. That is about 15-18% more, then amd. So 92-98 s/s on 1070 vs ~80-85 on 470 is proportional etherium hashrate difference.
legendary
Activity: 3248
Merit: 1072
November 15, 2016, 02:41:03 AM
#19
Do you think, you can start developing miners for Nvidia?
There is a lack of skilled and honest developers.

Kind of hard when I don't have any Nvidia cards.  Even if I did, I'd stick to OpenCL, as I have no interest in learning CUDA.


even if the potential might be higher? or maybe you know already that this will be never the case?
sr. member
Activity: 588
Merit: 251
November 14, 2016, 08:28:15 PM
#18
p.s. I also have another idea that should work on 4GB cards.  The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion.  This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record.  This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones.  This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration.  That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second.  Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.


so the 225  would be max for the 4gb and the 8gb cards?

Yes.  I'm pretty sure with 3.5GB for the table data that the remaining 0.5GB on a 4GB card would be enough for the row counters and any other small data structures required.
sr. member
Activity: 588
Merit: 251
November 14, 2016, 08:23:59 PM
#17
Do you think, you can start developing miners for Nvidia?
There is a lack of skilled and honest developers.

Kind of hard when I don't have any Nvidia cards.  Even if I did, I'd stick to OpenCL, as I have no interest in learning CUDA.
legendary
Activity: 4354
Merit: 9201
'The right to privacy matters'
November 14, 2016, 08:15:30 PM
#16
On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?

I still have some work to do before I write my own miner from scratch.  I like to *really* understand the problem before I start writing code, and there's still some parts of the GCN architecture that I'm figuring out.

Eth miners max out at around 93% of the theoretical maximum.  24Mh/s is the theoretical max for a R9 380 with 6Gbps memory, and I've been able to get 22.3Mh out of a couple cards.  You'll never reach 100% due to the fact that refresh consumes some of the bandwdith, perhaps as much as 5%.

p.s. I also have another idea that should work on 4GB cards.  The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion.  This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record.  This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones.  This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration.  That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second.  Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.


so the 225  would be max for the 4gb and the 8gb cards?
legendary
Activity: 1108
Merit: 1005
November 14, 2016, 06:23:42 PM
#15
Do you think, you can start developing miners for Nvidia?
There is a lack of skilled and honest developers.
hero member
Activity: 751
Merit: 517
Fail to plan, and you plan to fail.
November 14, 2016, 06:06:00 PM
#14
Fascinating stuff really. Thank you for trying to explain to us laymen how this stuff works. I'm not much of a programmer myself, but Ive always wanted to try and understand better how miners work, what kind of data is processed and how etc. Ill be following this thread closely Smiley
Also I have huge respect for people like you, genoil, mrvb etc who work hard on these complex problems and still release stuff for free.
sr. member
Activity: 588
Merit: 251
November 14, 2016, 03:30:40 PM
#13
On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?

I still have some work to do before I write my own miner from scratch.  I like to *really* understand the problem before I start writing code, and there's still some parts of the GCN architecture that I'm figuring out.

Eth miners max out at around 93% of the theoretical maximum.  24Mh/s is the theoretical max for a R9 380 with 6Gbps memory, and I've been able to get 22.3Mh out of a couple cards.  You'll never reach 100% due to the fact that refresh consumes some of the bandwdith, perhaps as much as 5%.

p.s. I also have another idea that should work on 4GB cards.  The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion.  This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record.  This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones.  This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration.  That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second.  Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.
sr. member
Activity: 588
Merit: 251
November 14, 2016, 03:21:30 PM
#12
What role does the memory bus width play into regarding the speeds? Because many of these old 7950s are getting almost the same speeds as the 470/390.

That makes sense, since a 384-bit wide memory bus at 1.5Ghz (6Gbps) has a bit more bandwidth than a 256-bit wide bus at 8Gbps.
sr. member
Activity: 438
Merit: 250
November 14, 2016, 01:59:23 PM
#11
Dude where is your own miner Grin.

Next coin I expect you to be one of the top dogs in the pit  Kiss

On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?
newbie
Activity: 58
Merit: 0
November 14, 2016, 12:27:48 PM
#10
What role does the memory bus width play into regarding the speeds? Because many of these old 7950s are getting almost the same speeds as the 470/390.

I just bought an older 3gb version of 7950 to see how they perform with your optimized memory straps.
Also I have the 470 4gb Nitro wich makes 110-120 sols

I was wondering about the memory bus as well.
legendary
Activity: 3808
Merit: 1723
November 14, 2016, 12:22:14 PM
#9
What role does the memory bus width play into regarding the speeds? Because many of these old 7950s are getting almost the same speeds as the 470/390.

sr. member
Activity: 588
Merit: 251
November 14, 2016, 12:17:32 PM
#8
Does it mean the R9 390 which has 512 bit memory bus and 1500 Mhz, should be faster than the 470?

An optimal implementation should be faster.
newbie
Activity: 18
Merit: 0
November 14, 2016, 12:05:02 PM
#7
Does it mean the R9 390 which has 512 bit memory bus and 1500 Mhz, should be faster than the 470?
sr. member
Activity: 588
Merit: 251
November 14, 2016, 08:55:19 AM
#6
...

Therefore a reasonably efficient equihash implementation will do 5 * 64 * 1 million bytes (320MB) of IO per round.  With 9 rounds that means 2.88GB per itteration, or 77.8 itterations per second on a Rx 470 with RAM clocked at 7Gbps (224GB/s memory bandwidth).  At 1.88 solutions per iteration, that's an average of 146 solutions/second, or about 25% faster than Claymore v5.

The theoretical equihash performance limit on a Rx 470 is likely about 25% faster than 146 solutions, but it involves using 64-byte data structures that requires a lot more memory.  So much memory that I think it will not be possible with 4GB cards.  At least it will be something for owners of 8GB Rx 480 cards to be happy about.

A few noob questions if you don't mind.
What's the theoretical limit on the RX 470 8G Nitro cards with RAM clocked at 8Gbps (256GB/s)? Also, does overclocking the memory result in a linear increase in performance?
Does this all mean that equihash solving isn't GPU compute limited, but rather memory limited? If so, I wonder why GPU-Z shows 100% GPU load vs sub-40% memory controller load (whereas mining Eth fully loads both core and mem controller...)

Fascinating stuff. Thanks in advance.

A Rx 470 at 8Gbps would have a theoretical limit 8/7 times faster than one at 7Gbps.
The only part of equihash that is compute limited is the blake2b initialization.  The intention of the authors was for the algorithm to be limited by memory bandwidth.
https://www.internetsociety.org/sites/default/files/blogs-media/equihash-asymmetric-proof-of-work-based-generalized-birthday-problem.pdf

As for what GPU-z shows, you'll have to figure out how to correctly interpret what it reports on your own.  I do my OpenCL development on Linux, and even if there was a Linux version, I don't consider GPU-z a useful tool for kernel developers.
full member
Activity: 157
Merit: 100
November 14, 2016, 12:21:51 AM
#5
...

Therefore a reasonably efficient equihash implementation will do 5 * 64 * 1 million bytes (320MB) of IO per round.  With 9 rounds that means 2.88GB per itteration, or 77.8 itterations per second on a Rx 470 with RAM clocked at 7Gbps (224GB/s memory bandwidth).  At 1.88 solutions per iteration, that's an average of 146 solutions/second, or about 25% faster than Claymore v5.

The theoretical equihash performance limit on a Rx 470 is likely about 25% faster than 146 solutions, but it involves using 64-byte data structures that requires a lot more memory.  So much memory that I think it will not be possible with 4GB cards.  At least it will be something for owners of 8GB Rx 480 cards to be happy about.

A few noob questions if you don't mind.
What's the theoretical limit on the RX 470 8G Nitro cards with RAM clocked at 8Gbps (256GB/s)? Also, does overclocking the memory result in a linear increase in performance?
Does this all mean that equihash solving isn't GPU compute limited, but rather memory limited? If so, I wonder why GPU-Z shows 100% GPU load vs sub-40% memory controller load (whereas mining Eth fully loads both core and mem controller...)

Fascinating stuff. Thanks in advance.
sr. member
Activity: 588
Merit: 251
November 13, 2016, 10:22:34 PM
#4
"it involves using 64-byte data structures"

how much changes/coding transition to 64-byte data structures require?

Someone like eXtremal could probably do it in a week, re-using parts of silentarmy.  It would take me 2-3 times longer.  I can write top-quality code, but I don't pump it out as fast as some other coders.
hero member
Activity: 1008
Merit: 1000
November 13, 2016, 08:59:35 PM
#3
Interesting to see that after 2 weeks we are fairly close to the limits.
hero member
Activity: 672
Merit: 500
November 13, 2016, 08:52:52 PM
#2
"it involves using 64-byte data structures"

how much changes/coding transition to 64-byte data structures require?
sr. member
Activity: 588
Merit: 251
November 13, 2016, 08:28:53 PM
#1
As I write this the fastest ZEC miners are Claymore v5 and Optiminer v 0.3.1.  Just yesterday silentarmy v5 was faster than Claymore v4, but then Claymore v5 leapfrogged silentarmy.  But the days of doubling performance of ZEC miners is over, as the software is approaching the hardware performance limits of the GPUs (at least AMD GPUs).  In order to understand why, it helps to understand a bit about the zcash equihash algorithm.  For the math nerds, it's based on Wagner's algorithm for solving the generalized birthday problem.  Specifically, 2 million pseudo-random numbers are generated using blake2b (see http://blake2.net/).  Each of these numbers is 200 bits (25 bytes), and they are sorted to find pairs of numbers that result in collisions on the first 20 bits.  On average, there is about 2 million pairs that collide on the first 20 bits.  Those pairs are XORed, and the resulting numbers are sorted on the next 20 bits.  This continues for 8 rounds, until 40 bits are left, where there will be 2 (actually 1.88) collisions on the last 40 bits.  These last 2 collisions are the solutions to the equihash proof of work.

Starting with 25 bytes of data, the natural choice for a data structure would be records of 32 bytes each.  In the silentarmy implementation (https://github.com/mbevand/silentarmy) these records are called slots.  Although the original (CPU-based) equihash algorithm uses a radix sort, the fastest sorting algorithm for equihash is a bin sort, with 2^20 (1 million) bins (silentarmy calls them rows).  At each round, the next 20 bits determine the bin to save the XOR data to, followed by a scan of all bins to find those with at least 2 records (slots) filled.  With an average of 2 million records of 32-bytes each, that's 64MB of data to scan each round.  You might think that there's also 64MB of data to write (into the bins) each round, but on an AMD GPU, there will be 128MB of writes to RAM for storing data in the bins.  The reason is that AMD memory channels are 64-bits wide, and GDDR5 transfers a minimum of an 8-bit burst, so AMD GPUs transfer a minimum of 64 bytes of data to RAM at a time.  In addition, if the GPU kernel writes less than 64 contiguous bytes at a time, the memory controller will read 64 bytes, modify some of the bytes, and then write 64 back to RAM.  Therefore writing 2 32-byte slots to a bin involves reading 64 bytes, writing 64 bytes, and repeating once more.  Therefore a reasonably efficient equihash implementation will do 5 * 64 * 1 million bytes (320MB) of IO per round.  With 9 rounds that means 2.88GB per itteration, or 77.8 itterations per second on a Rx 470 with RAM clocked at 7Gbps (224GB/s memory bandwidth).  At 1.88 solutions per iteration, that's an average of 146 solutions/second, or about 25% faster than Claymore v5.

The theoretical equihash performance limit on a Rx 470 is likely about 25% faster than 146 solutions, but it involves using 64-byte data structures that requires a lot more memory.  So much memory that I think it will not be possible with 4GB cards.  At least it will be something for owners of 8GB Rx 480 cards to be happy about.
Jump to: