limits of ZEC mining | Bitcointalksearch.org

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: mjosephs on September 13, 2017, 08:07:36 PM

Quote from: nerdralph on November 18, 2016, 09:59:31 AM

Yesterday I realized I forgot to account for the 1:1 command:burst ratio. Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.

I don't think this is correct. See the diagram below from the JEDEC GDDR5 specification showing gapless reads from a single row; half the command slots are NOPs (particularly at time=T1). These slots can be used to send ACTIVATE commands to other banks, so as long as you distribute your workload across the chip's banks (and observe that annoying tFAW or find a chip that doesn't really need it) you can indeed do totally random reads at the full pin bitrate and the only requirement is that the read size is at least 32bits*8wordburst=256bits=32bytes by using AUTO PRECHARGE to eliminate the explicit PRECHARGE command on the command bus.

In particular, the command:burst ratio is 2:1 for 8word bursts, not 1:1. Maybe you missed the fact that the address pins are DDR in GDDR5?

I have implemented a system that does exactly this (stealing that NOP slot to ACTIVATE a different bank) on DDR2 and it works. Granted DDR2 is not GDDR5, but the idea that an ACTIVATE-READ command pair ought to use the same amount of command bus time as the data it procures is something JEDEC has worked to preserve across many generations of memory with a wide variety of burst lengths and timings. I don't think GDDR5 would depart from this lightly.

Everything I wrote in this post is about GDDR5 independently of any particular GPU or even all of the GPUs on the market taken together. GPU memory controllers are not optimized for random/scattered reads like you find in most cryptocurrency mining PoWs. I would not be surprised if no GPU is actually able to do scattered full-bandwidth reads at the minimum 256-bit granularity allowed by the GDDR5 spec; that's just not something that's a top priority for rendering video games.

I've been focused on non-mining projects over the summer, but now I'm getting ready to fire up a couple rigs and get back into mining.
I looked into auto precharge back into the spring, and couldn't find anything in the timing straps that relate to auto precharge. I'm guessing it is part of the memory controller firmware.
Also don't forget that with AMD GCN, each channel it 64-bits wide (2 GDDR5 chips), so the minimum random read is effectively 64 bytes. And since there seems to be no way to control auto precharge, the minimum full-speed read is 128bytes. Note that 128 bytes is exactly the size of a DAG read in ethash (ethereum).

Writes are more complicated, since Marc Bevand found that the GCN memory controller is capable of writing to just one of the two GDDR chips in a channel when one half of a cache line is not dirty.

NobodyIsHome

jr. member

Activity: 74

Merit: 1

Quote from: mjosephs on September 13, 2017, 08:07:36 PM

Everything I wrote in this post is about GDDR5 independently of any particular GPU or even all of the GPUs on the market taken together. GPU memory controllers are not optimized for random/scattered reads like you find in most cryptocurrency mining PoWs. I would not be surprised if no GPU is actually able to do scattered full-bandwidth reads at the minimum 256-bit granularity allowed by the GDDR5 spec; that's just not something that's a top priority for rendering video games.

I agree.

Graphics rendering mostly operate as embarrassingly parallel convolutions on texture/framebuffer patches. Relatively random memory access (in VRAM) is much slower than on a CPU.

In addition, I find that in my OpenCL projects, random reads can often fall back into serialized fetches which will stall the CU as it waits for a few threads in wavefront to catch up with the rest.

mjosephs

full member

Activity: 129

Merit: 100

Quote from: nerdralph on November 18, 2016, 09:59:31 AM

Yesterday I realized I forgot to account for the 1:1 command:burst ratio. Because of that, doing 64-byte random writes will be no faster than 128-byte random writes.

I don't think this is correct. See the diagram below from the JEDEC GDDR5 specification showing gapless reads from a single row; half the command slots are NOPs (particularly at time=T1). These slots can be used to send ACTIVATE commands to other banks, so as long as you distribute your workload across the chip's banks (and observe that annoying tFAW or find a chip that doesn't really need it) you can indeed do totally random reads at the full pin bitrate and the only requirement is that the read size is at least 32bits*8wordburst=256bits=32bytes by using AUTO PRECHARGE to eliminate the explicit PRECHARGE command on the command bus.

In particular, the command:burst ratio is 2:1 for 8word bursts, not 1:1. Maybe you missed the fact that the address pins are DDR in GDDR5?

I have implemented a system that does exactly this (stealing that NOP slot to ACTIVATE a different bank) on DDR2 and it works. Granted DDR2 is not GDDR5, but the idea that an ACTIVATE-READ command pair ought to use the same amount of command bus time as the data it procures is something JEDEC has worked to preserve across many generations of memory with a wide variety of burst lengths and timings. I don't think GDDR5 would depart from this lightly.

Everything I wrote in this post is about GDDR5 independently of any particular GPU or even all of the GPUs on the market taken together. GPU memory controllers are not optimized for random/scattered reads like you find in most cryptocurrency mining PoWs. I would not be surprised if no GPU is actually able to do scattered full-bandwidth reads at the minimum 256-bit granularity allowed by the GDDR5 spec; that's just not something that's a top priority for rendering video games.

mjosephs

full member

Activity: 129

Merit: 100

Quote from: nerdralph on November 19, 2016, 11:34:56 AM

A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.

Sorry about the necropost here, but isn't the dataset for ZEC more than 32MB? You wrote:

Quote from: nerdralph on November 13, 2016, 08:28:53 PM

Specifically, 2 million pseudo-random numbers are generated using blake2b (see http://blake2.net/). Each of these numbers is 200 bits (25 bytes), and they are sorted

200bits*2million = 400mbits = 50mbytes, right?

If I got the math wrong, or if there is some trick for halving the storage requirements, I would be happy to be corrected. Thanks!

datBTC

newbie

Activity: 16

Merit: 0

Quote from: minerx117 on March 21, 2017, 02:52:10 AM

So it's time for nvidia. 1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!

You cant get 720 SOL from a 1080, you must mean the 1080 Ti or Titan X Pascal.

m1n1ngP4d4w4n

full member

Activity: 224

Merit: 100

CryptoLearner

Quote from: lexele on March 21, 2017, 11:28:03 AM

Quote from: Ambros on March 21, 2017, 09:54:57 AM

Quote from: groovy1962 on March 21, 2017, 09:19:35 AM

Quote from: minerx117 on March 21, 2017, 02:52:10 AM

So it's time for nvidia. 1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!

I'm consistently getting 420 sols at 150watts on 1070s with EWBF (as reported by nvidia-smi)

When I did my spreadsheet it looked like nvidia was the better deal for ROI on ZEC. Don't have any RADEONs to compare to, though, right now.

Nvidia is definitely better than AMD on everything, except from ETHash

I don't have NV cards, but XMR doesn't seem to be their favorite either from hashrates I saw. Am I wrong?

It's more like the XMR miners hasn't been updated in quite some time, since XMR hasn't been profitable for nvidias card for half a year at least if not ever.

lexele

full member

Activity: 192

Merit: 100

Quote from: Ambros on March 21, 2017, 09:54:57 AM

Quote from: groovy1962 on March 21, 2017, 09:19:35 AM

Quote from: minerx117 on March 21, 2017, 02:52:10 AM

So it's time for nvidia. 1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!

I'm consistently getting 420 sols at 150watts on 1070s with EWBF (as reported by nvidia-smi)

When I did my spreadsheet it looked like nvidia was the better deal for ROI on ZEC. Don't have any RADEONs to compare to, though, right now.

Nvidia is definitely better than AMD on everything, except from ETHash

I don't have NV cards, but XMR doesn't seem to be their favorite either from hashrates I saw. Am I wrong?

Ambros

hero member

Activity: 653

Merit: 500

Quote from: groovy1962 on March 21, 2017, 09:19:35 AM

Quote from: minerx117 on March 21, 2017, 02:52:10 AM

So it's time for nvidia. 1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!

I'm consistently getting 420 sols at 150watts on 1070s with EWBF (as reported by nvidia-smi)

When I did my spreadsheet it looked like nvidia was the better deal for ROI on ZEC. Don't have any RADEONs to compare to, though, right now.

Nvidia is definitely better than AMD on everything, except from ETHash

groovy1962

newbie

Activity: 54

Merit: 0

Quote from: minerx117 on March 21, 2017, 02:52:10 AM

So it's time for nvidia. 1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!

I'm consistently getting 420 sols at 150watts on 1070s with EWBF (as reported by nvidia-smi)

When I did my spreadsheet it looked like nvidia was the better deal for ROI on ZEC. Don't have any RADEONs to compare to, though, right now.

minerx117

sr. member

Activity: 728

Merit: 256

NemosMiner-v3.8.1.3

So it's time for nvidia. 1080 at 290watt of 760sols, 1070 at 130watt of 420sols. Interesting!

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: ?? on ??

Why is it MY problem? Tongue

I'm a bit busy with VBIOS mods at this point. Figuring out what timings are causing slowdowns in algos like Eth is really fun and interesting. For example, once tRRD was lowered on the strap I was using, I gained half an MH/s (scales with clock.) Taking a lower strap entirely, however, will crash it. I've found tRFC to be WAAAAY too damn high, as well - this has somewhat less of an effect since DRAM is only refreshed so often, but it also helps. Writing tools to work with this shit also is taking up time.

Just looking back at this now that I've finally written a strap decoder. Are you sure you're seeing tRFC too high? I think tRFC too low might cause a bit of a performance hit mining since no commands can be processed during the refresh.

"The REFRESH command is used during normal operation of the GDDR5 SGRAM. The command is non
persistent, so it must be issued each time a refresh is required. A minimum time tRFC is required between
two REFRESH commands. The same rule applies to any access command after the refresh operation. All
banks must be precharged prior to the REFRESH command.
The refresh addressing is generated by the internal refresh controller. This makes the address bits ʺDonʹt
Careʺ during a REFRESH command. The GDDR5 SGRAM requires REFRESH cycles at an average peri‐
odic interval of tREFI(max). The values of tREFI for different densities are listed in Table 6. To allow for
improved efficiency in scheduling and switching between tasks, some flexibility in the absolute refresh
interval is provided. A maximum of eight REFRESH commands can be posted to the GDDR5 SGRAM, and
the maximum absolute interval between any REFRESH command and the next REFRESH command is 9 *
tREFI."

nerdralph

sr. member

Activity: 588

Merit: 251

After looking at some of zawawa's miner code, I realized using 16-byte data structures instead of 32 for the last few rounds would reduce the average number of core cycles required from 5 million per round to 4.5. Also since the current parameters result in an average of 1.9 solutions per equihash itteration, with early pruning of duplicates the amount of entries per round in the final rounds will be closer to 1.9 million than the 2 million I used in my calculations. These tweaks together means the upper limit for an OpenCl implementation on a Rx 470 (1250/1750 clocks) would be around 220sol/s, with the core clock being the primary limiting factor. It is also possible to optimize the final round where the collision search is 40 bits instead of 20, and this could push the performance limit close to 230 sols/s.

As I've previously explained, the memory clock (and timing) would be the primary limiting factor in a GCN assembler implementation. Based on my recent memory benchmarking, the performance limit would be ~280 sol's on a Rx 470.

I've also looked for shortcuts to the equihash algorithm and remain convinced that no significant (i.e. doubling of speed) optimizations can be made to the algorithm itself.

nerdralph

sr. member

Activity: 588

Merit: 251

I just finished doing some benchmarking, and have discovered that sequential write performance on AMD is 25-35% slower than sequential read performance. Read performance is close to expectations for Hawaii and Tonga, achieving ~ 90% of GDDR5 peak bandwidth. Pitcairn performance is about 10% slower, which suggests its memory controller is less sophisticated than the later GCN parts.
I'll publish my benchmark code after a bit more cleanup. In the mean time I can recommend another benchmark tool for testing the read bandwidth: clpeak.
https://github.com/krrishnarraj/clpeak

My benchmark reports read bandwidth that agrees with clpeak, so the write bandwidth numbers I'm seeing should be just as reliable.

edit: benchmark code posted to github. Detailed writeup to follow.
https://github.com/nerdralph/cl-mem

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: Genoil on November 22, 2016, 04:46:40 PM

Quote from: nerdralph on November 22, 2016, 03:15:08 PM

While my initial analysis was focused on the external GDDR5 bandwidth limits, current ZEC GPU mining software seems to be limited by the memory controller/core bus. On AMD GCN, each memory controller can xfer 64 bytes (1 cache line) per clock. In SA5, the ht_store function, in addition to adding to row counters, does 4 separate memory writes for most rounds (3 writes for the last couple rounds). All of these writes are either 4 or 8 bytes, so much less than 64 bytes per clock are being transferred to the L2 cache. A single thread (1 SIMD element) can transfer at most 16 bytes (dwordX4) in a single instruction. This means a modified ht_store thread could update a row slot in 2 clocks. If the update operation is split between 2 (or 4 or more) threads, one slot can be updated in one clock, since 2 threads can simultaneously write to different parts of the same 64-byte block. This would mean each row update operation could be done in 2 GPU core clock cycles; one for the counter update, and one for updating the row slot.

Even with those changes, my calculations indicate that a ZEC miner would be limited by the core clock, according to a ratio of approximately 5:6. In other words, when a Rx 470 has a memory clock of 1750Mhz, the core would need to be clocked at 1750 * 5/6 = 1458Mhz in order to achieve maximum performance.

If the row counters can be kept in LDS or GDS, the core:memory ratio required would be 1:2, thereby allowing full use of the external memory bandwidth. There is 64KB of LDS per CU, and the AMD GCN architecture docs indicate the LDS can be globally addressed; i.e. one CU can access the LDS of another CU. However the syntax of OpenCL does not permit the local memory of one work-group to be accessed by a different work-group. There is only 64KB of GDS shared by all CUs, and even if the row counters could be stored in such a small amount of memory, OpenCL does not have any concept of GDS.

This likely means writing a top performance ZEC miner for AMD is the domain of someone who codes in GCN assembler. Canis lupus?

This is what I was trying to achieve in that snippet I sent you a while ago, the coalesced write thing. I just lacked all the theory behind it Grin

I tried a simple mod to ht_store to try to reduce the number of writes; flat_store_dwordx3 and flat_store_dwordx4. I couldn't use a straight uint3 vector type, because a uint3 pointer has to be 16-byte alined:
http://www.informit.com/articles/article.aspx?p=1732873&seqNum=3

Structs only have to be aligned to the largest type, so using this compiles and executes OK:

Code:

typedef struct
{
uint3 uints;
} ui3;

ui3 buf1;
buf1.uints.s0 = i;
buf1.uints.s1 = xi0;
buf1.uints.s2 = xi0>>32;
// store "i" (always 4 bytes before Xi) and xi0
*(__global ui3 *)(p - 4) = buf1;

However the generated code is one flat_store_dword and one flat_store_dwordx2. I tried 3 individual uints in the struct, and that resulted in 3 flat_store_dword instructions.

xeridea

sr. member

Activity: 449

Merit: 251

Quote from: laik2 on November 24, 2016, 12:10:21 PM

Quote from: nerdralph on November 24, 2016, 12:03:30 PM

Quote from: xeridea on November 24, 2016, 09:56:33 AM

CM 8.0 getting 187Sol/s on 480 4GB (7Gbps), and 176 on 470 4GB (7Gbps), Stock clocks, so it seems 173 is not the theoretical limit.

Correct. See my Nov 20 post about data written to half a cache line. That puts the theoretical limit up around 200 for a OpenCL implementation, with the 1250Mhz core clock being the limiting factor rather (described in my Nov 22 post).

My Sapphires can go upto 1406 and 2250. Will that make any difference?

The 480 8G does close to 200 with 1300 and 2000. Some report as high as 215-220, but I don't like to push core/memory too much, want cards to last a long while, so I just undervolt, and change memory straps. Memory straps less of a boost the more optimized miner is it seems. I have mostly 470s though, doing 175-185.

laik2

sr. member

Activity: 652

Merit: 266

Quote from: nerdralph on November 24, 2016, 12:03:30 PM

Quote from: xeridea on November 24, 2016, 09:56:33 AM

CM 8.0 getting 187Sol/s on 480 4GB (7Gbps), and 176 on 470 4GB (7Gbps), Stock clocks, so it seems 173 is not the theoretical limit.

Correct. See my Nov 20 post about data written to half a cache line. That puts the theoretical limit up around 200 for a OpenCL implementation, with the 1250Mhz core clock being the limiting factor rather (described in my Nov 22 post).

My Sapphires can go upto 1406 and 2250. Will that make any difference?

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: xeridea on November 24, 2016, 09:56:33 AM

CM 8.0 getting 187Sol/s on 480 4GB (7Gbps), and 176 on 470 4GB (7Gbps), Stock clocks, so it seems 173 is not the theoretical limit.

Correct. See my Nov 20 post about data written to half a cache line. That puts the theoretical limit up around 200 for a OpenCL implementation, with the 1250Mhz core clock being the limiting factor rather (described in my Nov 22 post).

xeridea

sr. member

Activity: 449

Merit: 251

Quote from: xeridea on November 19, 2016, 10:35:18 PM

Quote from: nerdralph on November 19, 2016, 09:01:14 PM

Quote from: xeridea on November 19, 2016, 04:59:17 PM

Quote from: nerdralph on November 19, 2016, 11:34:56 AM

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated. If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%. It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it. I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements. I doubt anyone will find a serious mistake now. Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga. That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes. With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record. A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.

Yeah I was basing on 480 4GB, since 470 seems somewhat compute limited on CM 7.0. Since it is around the theoretical max, I am thinking there may be ways to get a bit more, some tricks to save some bandwidth, but of course we can't know since we are not CM. He says he figures 300H is possible on 390X, which currently gets 230, but it will take a lot more work.

https://bitcointalksearch.org/topic/m.16924145

300 for a 390x is 1.73x the 173 limit I calculated for a Rx 470. The 384GB/s memory bandwidth of the 390x is 1.71x the 224GB/s memory bandwidth of the Rx 470. I'd say that's not a coincidence, and Claymore has come to the same conclusions that I have about the limits of ZEC mining performance.

Hmm, I was thinking 390x had 320GB/s memory bandwidth, I guess that's the 290x. It seemed right since 390x isn't much faster than 480 at Ethereum. Its bout 27% faster Ethereum, and 27% faster Ethash (as of CM 7.0, and many revisions before, this is general trend). Both algorithms are affected by timings, though less so with newer CM versions. Your estimate seems reasonable, I am just speculating. I don't do side projects programming these days, developing hand issues, so I can just speculate Sad

CM 8.0 getting 187Sol/s on 480 4GB (7Gbps), and 176 on 470 4GB (7Gbps), Stock clocks, so it seems 173 is not the theoretical limit.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: ?? on ??

I'm a bit busy with VBIOS mods at this point. Figuring out what timings are causing slowdowns in algos like Eth is really fun and interesting. For example, once tRRD was lowered on the strap I was using, I gained half an MH/s (scales with clock.) Taking a lower strap entirely, however, will crash it. I've found tRFC to be WAAAAY too damn high, as well - this has somewhat less of an effect since DRAM is only refreshed so often, but it also helps. Writing tools to work with this shit also is taking up time.

That seems peculiar. A 128-byte DAG read is 2 cache lines, and would take 4 clocks (2 clocks/burst). If tRRD is 4, then the ACTIVATE command rate can keep up to the DAG reads. If it is 5, then you'd have a 20% performance hit. Maybe a bit less since 1 of every 8 ACTIVATE commands should be to the same bank rather than a different one. What offset is tRRD anyway?

edit: after giving this some more thought, I think I might know what's going on. If tRRD is 5, with a large queue of read requests, the memory controller could group multiple page open commands to the same bank, so that most of the time tRRD is not a factor. It still needs to read from all 8 banks, so occasionally it has to wait 5 clocks instead of 4 for an ACTIVATE.

deadsix

hero member

Activity: 751

Merit: 517

Fail to plan, and you plan to fail.

Quote from: ?? on ??

Writing tools to work with this shit also is taking up time.

Eagerly Awaited, Good Sir.

Topic: limits of ZEC mining (Read 10069 times)