limits of ZEC mining - page 2. | Bitcointalksearch.org

greaterninja

hero member

Activity: 924

Merit: 1000

Quote from: ?? on ??

Quote from: nerdralph on November 22, 2016, 03:15:08 PM

While my initial analysis was focused on the external GDDR5 bandwidth limits, current ZEC GPU mining software seems to be limited by the memory controller/core bus. On AMD GCN, each memory controller can xfer 64 bytes (1 cache line) per clock. In SA5, the ht_store function, in addition to adding to row counters, does 4 separate memory writes for most rounds (3 writes for the last couple rounds). All of these writes are either 4 or 8 bytes, so much less than 64 bytes per clock are being transferred to the L2 cache. A single thread (1 SIMD element) can transfer at most 16 bytes (dwordX4) in a single instruction. This means a modified ht_store thread could update a row slot in 2 clocks. If the update operation is split between 2 (or 4 or more) threads, one slot can be updated in one clock, since 2 threads can simultaneously write to different parts of the same 64-byte block. This would mean each row update operation could be done in 2 GPU core clock cycles; one for the counter update, and one for updating the row slot.

Even with those changes, my calculations indicate that a ZEC miner would be limited by the core clock, according to a ratio of approximately 5:6. In other words, when a Rx 470 has a memory clock of 1750Mhz, the core would need to be clocked at 1750 * 5/6 = 1458Mhz in order to achieve maximum performance.

If the row counters can be kept in LDS or GDS, the core:memory ratio required would be 1:2, thereby allowing full use of the external memory bandwidth. There is 64KB of LDS per CU, and the AMD GCN architecture docs indicate the LDS can be globally addressed; i.e. one CU can access the LDS of another CU. However the syntax of OpenCL does not permit the local memory of one work-group to be accessed by a different work-group. There is only 64KB of GDS shared by all CUs, and even if the row counters could be stored in such a small amount of memory, OpenCL does not have any concept of GDS.

This likely means writing a top performance ZEC miner for AMD is the domain of someone who codes in GCN assembler. Canis lupus?

Why is it MY problem? Tongue

I'm a bit busy with VBIOS mods at this point. Figuring out what timings are causing slowdowns in algos like Eth is really fun and interesting. For example, once tRRD was lowered on the strap I was using, I gained half an MH/s (scales with clock.) Taking a lower strap entirely, however, will crash it. I've found tRFC to be WAAAAY too damn high, as well - this has somewhat less of an effect since DRAM is only refreshed so often, but it also helps. Writing tools to work with this shit also is taking up time.

I did this for about 2 months with the radeon 480 cards. Was a very time consuming process that did yield some interesting results.

I am now looking into miner development. I think there is a bigger picture here, but that is for private discussion.

Genoil

sr. member

Activity: 438

Merit: 250

Quote from: nerdralph on November 22, 2016, 03:15:08 PM

While my initial analysis was focused on the external GDDR5 bandwidth limits, current ZEC GPU mining software seems to be limited by the memory controller/core bus. On AMD GCN, each memory controller can xfer 64 bytes (1 cache line) per clock. In SA5, the ht_store function, in addition to adding to row counters, does 4 separate memory writes for most rounds (3 writes for the last couple rounds). All of these writes are either 4 or 8 bytes, so much less than 64 bytes per clock are being transferred to the L2 cache. A single thread (1 SIMD element) can transfer at most 16 bytes (dwordX4) in a single instruction. This means a modified ht_store thread could update a row slot in 2 clocks. If the update operation is split between 2 (or 4 or more) threads, one slot can be updated in one clock, since 2 threads can simultaneously write to different parts of the same 64-byte block. This would mean each row update operation could be done in 2 GPU core clock cycles; one for the counter update, and one for updating the row slot.

Even with those changes, my calculations indicate that a ZEC miner would be limited by the core clock, according to a ratio of approximately 5:6. In other words, when a Rx 470 has a memory clock of 1750Mhz, the core would need to be clocked at 1750 * 5/6 = 1458Mhz in order to achieve maximum performance.

If the row counters can be kept in LDS or GDS, the core:memory ratio required would be 1:2, thereby allowing full use of the external memory bandwidth. There is 64KB of LDS per CU, and the AMD GCN architecture docs indicate the LDS can be globally addressed; i.e. one CU can access the LDS of another CU. However the syntax of OpenCL does not permit the local memory of one work-group to be accessed by a different work-group. There is only 64KB of GDS shared by all CUs, and even if the row counters could be stored in such a small amount of memory, OpenCL does not have any concept of GDS.

This likely means writing a top performance ZEC miner for AMD is the domain of someone who codes in GCN assembler. Canis lupus?

This is what I was trying to achieve in that snippet I sent you a while ago, the coalesced write thing. I just lacked all the theory behind it Grin

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

This is juicy stuff! Thanks, nerdralph!

nerdralph

sr. member

Activity: 588

Merit: 251

While my initial analysis was focused on the external GDDR5 bandwidth limits, current ZEC GPU mining software seems to be limited by the memory controller/core bus. On AMD GCN, each memory controller can xfer 64 bytes (1 cache line) per clock. In SA5, the ht_store function, in addition to adding to row counters, does 4 separate memory writes for most rounds (3 writes for the last couple rounds). All of these writes are either 4 or 8 bytes, so much less than 64 bytes per clock are being transferred to the L2 cache. A single thread (1 SIMD element) can transfer at most 16 bytes (dwordX4) in a single instruction. This means a modified ht_store thread could update a row slot in 2 clocks. If the update operation is split between 2 (or 4 or more) threads, one slot can be updated in one clock, since 2 threads can simultaneously write to different parts of the same 64-byte block. This would mean each row update operation could be done in 2 GPU core clock cycles; one for the counter update, and one for updating the row slot.

Even with those changes, my calculations indicate that a ZEC miner would be limited by the core clock, according to a ratio of approximately 5:6. In other words, when a Rx 470 has a memory clock of 1750Mhz, the core would need to be clocked at 1750 * 5/6 = 1458Mhz in order to achieve maximum performance.

If the row counters can be kept in LDS or GDS, the core:memory ratio required would be 1:2, thereby allowing full use of the external memory bandwidth. There is 64KB of LDS per CU, and the AMD GCN architecture docs indicate the LDS can be globally addressed; i.e. one CU can access the LDS of another CU. However the syntax of OpenCL does not permit the local memory of one work-group to be accessed by a different work-group. There is only 64KB of GDS shared by all CUs, and even if the row counters could be stored in such a small amount of memory, OpenCL does not have any concept of GDS.

This likely means writing a top performance ZEC miner for AMD is the domain of someone who codes in GCN assembler. Canis lupus?

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: mrb on November 15, 2016, 12:57:50 PM

Quote from: nerdralph on November 15, 2016, 12:18:49 PM

You are incorrect. Cache lines are 64-bytes long because AMD memory channels are 64-bits wide (i.e. 2 DDR5 chips). The GCN memory controller is not 32-bits wide, with 2 consecutive bursts to fill a cache line.
"Each memory controller is 64-bits wide and composed of two independent 32-bit GDDR5 memory channels." - pg 10:
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

I read tons and tons of docs (including this whitepaper), but somehow missed that one line. Ok. Misconception clarified

Marc did some testing following this post, and determined that while reads result in 32 bytes being read from the two GDDR5 memory channels to fill a 64-byte cache line, writes are different. When data is only written to half of a cache line (32 bytes), due to the dirty byte mask the controller knows only one of the 2 GDDR5 memory channels is affected, and so will only write to one of them. However this does not mean the write bandwidth is double what I originally calculated, as writing 2 32-byte chunks of memory to the memory controller requires 2 core clocks. This would require the GPU core to be clocked the same as the memory, i.e. 2Ghz for a Rx 480 with 8Gbps RAM. This is due to the core:memory clock ratio limit I described here:
https://bitcointalksearch.org/topic/amd-corememory-clock-ratio-fundamental-limit-05-1682003

xeridea

sr. member

Activity: 449

Merit: 251

Quote from: nerdralph on November 19, 2016, 09:01:14 PM

Quote from: xeridea on November 19, 2016, 04:59:17 PM

Quote from: nerdralph on November 19, 2016, 11:34:56 AM

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated. If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%. It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it. I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements. I doubt anyone will find a serious mistake now. Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga. That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes. With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record. A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.

Yeah I was basing on 480 4GB, since 470 seems somewhat compute limited on CM 7.0. Since it is around the theoretical max, I am thinking there may be ways to get a bit more, some tricks to save some bandwidth, but of course we can't know since we are not CM. He says he figures 300H is possible on 390X, which currently gets 230, but it will take a lot more work.

https://bitcointalksearch.org/topic/m.16924145

300 for a 390x is 1.73x the 173 limit I calculated for a Rx 470. The 384GB/s memory bandwidth of the 390x is 1.71x the 224GB/s memory bandwidth of the Rx 470. I'd say that's not a coincidence, and Claymore has come to the same conclusions that I have about the limits of ZEC mining performance.

Hmm, I was thinking 390x had 320GB/s memory bandwidth, I guess that's the 290x. It seemed right since 390x isn't much faster than 480 at Ethereum. Its bout 27% faster Ethereum, and 27% faster Ethash (as of CM 7.0, and many revisions before, this is general trend). Both algorithms are affected by timings, though less so with newer CM versions. Your estimate seems reasonable, I am just speculating. I don't do side projects programming these days, developing hand issues, so I can just speculate Sad

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: xeridea on November 19, 2016, 04:59:17 PM

Quote from: nerdralph on November 19, 2016, 11:34:56 AM

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated. If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%. It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it. I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements. I doubt anyone will find a serious mistake now. Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga. That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes. With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record. A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.

Yeah I was basing on 480 4GB, since 470 seems somewhat compute limited on CM 7.0. Since it is around the theoretical max, I am thinking there may be ways to get a bit more, some tricks to save some bandwidth, but of course we can't know since we are not CM. He says he figures 300H is possible on 390X, which currently gets 230, but it will take a lot more work.

https://bitcointalksearch.org/topic/m.16924145

300 for a 390x is 1.73x the 173 limit I calculated for a Rx 470. The 384GB/s memory bandwidth of the 390x is 1.71x the 224GB/s memory bandwidth of the Rx 470. I'd say that's not a coincidence, and Claymore has come to the same conclusions that I have about the limits of ZEC mining performance.

xeridea

sr. member

Activity: 449

Merit: 251

Quote from: nerdralph on November 19, 2016, 11:34:56 AM

Quote from: xeridea on November 18, 2016, 08:17:17 PM

Quote from: nerdralph on November 18, 2016, 09:59:31 AM

Yesterday I realized I forgot to account for the 1:1 command:burst ratio. Because of that, doing 64-byte random writes will be no faster than 128-byte random writes. It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes. About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration. That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio. Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.

Claymore 7.0 I get around 155 Sol/s on 470 4GB, and 164 Sol/s on 480 4GB, default clocks. So either V7 is at/above theoretical limit, there are additional tricks to get speed, or theoretical limit is actually higher.

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated. If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%. It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it. I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements. I doubt anyone will find a serious mistake now. Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga. That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes. With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record. A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.

Yeah I was basing on 480 4GB, since 470 seems somewhat compute limited on CM 7.0. Since it is around the theoretical max, I am thinking there may be ways to get a bit more, some tricks to save some bandwidth, but of course we can't know since we are not CM. He says he figures 300H is possible on 390X, which currently gets 230, but it will take a lot more work.

https://bitcointalksearch.org/topic/m.16924145

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: xeridea on November 18, 2016, 08:17:17 PM

Quote from: nerdralph on November 18, 2016, 09:59:31 AM

Yesterday I realized I forgot to account for the 1:1 command:burst ratio. Because of that, doing 64-byte random writes will be no faster than 128-byte random writes. It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes. About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration. That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio. Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.

Claymore 7.0 I get around 155 Sol/s on 470 4GB, and 164 Sol/s on 480 4GB, default clocks. So either V7 is at/above theoretical limit, there are additional tricks to get speed, or theoretical limit is actually higher.

I'm getting 150 sol/s on my Rx 470, and even your 155 is still below the 160 sols/s limit I calculated. If the 164 Sols/s you are seeing on the Rx 480 is with a 1750Mhz memory clock, then that is ~95% of my 173 sols/s instead of 93%. It's possible the refresh may only impact the burst transfers and not the command rate, which could mean no impact on the 173 sols/s since less than 75% of the burst transfer slots are being used.

If someone writes a miner that gets substantially more than 173 sols/s (i.e. ~200 sols/s) on a Rx 470 with 7Gbps memory, that would be conclusive proof that there is a way to avoid the limit as I calculated it. I've discussed my ideas with other miner developers, and I have considered various ways of reducing the external GDDR bandwidth requirements. I doubt anyone will find a serious mistake now. Something I just noticed is that the L2 cache on Polaris/Ellesmere is 2MB (512KB/controller), compared to 128KB/controller on previous chips like Tonga. That is still too small, IMO, to be of much (more than a few %) help.

Computer scientists and mathematicians have studied sorting problems for decades, and I am convinced that sorting n random records requires at least n reads plus n writes. With some work it should be possible to implement an equihash algorithm where the average amount of data manipulated per round is 16 bytes per record. A GPU with a 32MB cache would then be limited to it's cache bandwidth instead of the external GDDR5 bandwidth.

xeridea

sr. member

Activity: 449

Merit: 251

Quote from: reb0rn21 on November 18, 2016, 08:25:21 PM

rx480 have 8Gbs memory which is fater compared to rx470 so speed is not over theoretical, 480 should be faster then 470 on default

Both the 470 and 480 have 4GB and 8GB variants. 8GB 480 gets 180 Sol/s. 4GB cards have 6.6-7 GHz effective speed, 8GB typically 8GHz, but some (MSI 470 ...), manufactures cheaped out. Ethereum, 470 and 480 are nearly identical, given the same memory.

Sapphire Nitro cards (7 or 8GHz), this is what I get

470 4GB 155
470 8GB 160

480 4GB 165
480 8GB 178

All have memory strap mod, but default memory clock, so since the theory discussed is only based on bandwidth, it shouldn't affect the limit.

reb0rn21

legendary

Activity: 1901

Merit: 1024

rx480 have 8Gbs memory which is fater compared to rx470 so speed is not over theoretical, 480 should be faster then 470 on default

xeridea

sr. member

Activity: 449

Merit: 251

Quote from: nerdralph on November 18, 2016, 09:59:31 AM

Quote from: nerdralph on November 14, 2016, 03:30:40 PM

p.s. I also have another idea that should work on 4GB cards. The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion. This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record. This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones. This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration. That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second. Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.

Yesterday I realized I forgot to account for the 1:1 command:burst ratio. Because of that, doing 64-byte random writes will be no faster than 128-byte random writes. It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes. About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration. That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio. Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.

Claymore 7.0 I get around 155 Sol/s on 470 4GB, and 164 Sol/s on 480 4GB, default clocks. So either V7 is at/above theoretical limit, there are additional tricks to get speed, or theoretical limit is actually higher.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: nerdralph on November 14, 2016, 03:30:40 PM

p.s. I also have another idea that should work on 4GB cards. The miner could use 12-slot bins of 32 bytes, just like silentarmy, but use a new table every round instead of using 2 tables in a double-buffered fashion. This would use 384MB * 9 =~ 3.5GB, but then your first write to any row could write 32-bytes of dummy data along with the 32-byte collision record. This would avoid the read-before-write. You could do this with the 2nd through 6th write by filling the even slots before the odd ones. This would reduce the average IO per round to 2^20 * 3 * 64-bytes, or 192MB per round and 1.728GB per iteration. That would be a theoretical max of 130 iterations per second on a Rx 470 with a 7Gbps memory clock, which would be around 240 solutions per second. Using 93% of the theoretical limit taken from eth mining, that would give real-world performance of 225 sols/s.

Yesterday I realized I forgot to account for the 1:1 command:burst ratio. Because of that, doing 64-byte random writes will be no faster than 128-byte random writes. It should still be faster than the 3 read + 2 write algorithm, as there will be no write-after-read delays, and the high page hit rate of reading the tables will free up some command slots for the random writes. About 4.2 cache line transfers per row is what I figure, or 269MB per round and 2.42GB per iteration. That's 173 sols/s for the Rx 470 before refresh overhead, or 160 sols/s using the 93% ratio. Claymore V6 does about 140 sols/s on the Rx 470, so there's only about 10-15% more room for improvement.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: Kyubey on November 15, 2016, 01:02:27 PM

So, looks like Claymore v6 reached the limit?

I was just about to post that, as I'm getting ~145 sol on a Rx 470 clocked at 1250/1750. I expect it will take much longer now to reach the 225 sols limit for 64-byte IO optimization.

Kyubey

newbie

Activity: 35

Merit: 0

So, looks like Claymore v6 reached the limit?

mrb

legendary

Activity: 1512

Merit: 1028

Quote from: nerdralph on November 15, 2016, 12:18:49 PM

You are incorrect. Cache lines are 64-bytes long because AMD memory channels are 64-bits wide (i.e. 2 DDR5 chips). The GCN memory controller is not 32-bits wide, with 2 consecutive bursts to fill a cache line.
"Each memory controller is 64-bits wide and composed of two independent 32-bit GDDR5 memory channels." - pg 10:
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

I read tons and tons of docs (including this whitepaper), but somehow missed that one line. Ok. Misconception clarified

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: mrb on November 15, 2016, 11:28:34 AM

Quote from: nerdralph on November 13, 2016, 08:28:53 PM

The reason is that AMD memory channels are 64-bits wide, and GDDR5 transfers a minimum of an 8-bit burst, so AMD GPUs transfer a minimum of 64 bytes of data to RAM at a time. In addition, if the GPU kernel writes less than 64 contiguous bytes at a time, the memory controller will read 64 bytes, modify some of the bytes, and then write 64 back to RAM.

This is actually incorrect. The GDDR4-5 channels are 32 bits wide, so the data granularity is 32 bytes. However, cache lines are 64 bytes long, so in practice you often have 2 consecutive bursts, hence 64 bytes read or written.

And for HBM memory (R9 Nano & R9 Fury) the channels are 128 bits wide, however the burst length is reduced to 2, maintaining the same granularity of 32 bytes.

You are incorrect. Cache lines are 64-bytes long because AMD memory channels are 64-bits wide (i.e. 2 DDR5 chips). The GCN memory controller is not 32-bits wide, with 2 consecutive bursts to fill a cache line.
"Each memory controller is 64-bits wide and composed of two independent 32-bit GDDR5 memory channels." - pg 10:
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

edit: see also slide #34
http://www.slideshare.net/DevCentralAMD/gs4106-the-amd-gcn-architecture-a-crash-course-by-layla-mah

mrb

legendary

Activity: 1512

Merit: 1028

Quote from: nerdralph on November 13, 2016, 08:28:53 PM

The reason is that AMD memory channels are 64-bits wide, and GDDR5 transfers a minimum of an 8-bit burst, so AMD GPUs transfer a minimum of 64 bytes of data to RAM at a time. In addition, if the GPU kernel writes less than 64 contiguous bytes at a time, the memory controller will read 64 bytes, modify some of the bytes, and then write 64 back to RAM.

This is actually incorrect. The GDDR4-5 channels are 32 bits wide, so the data granularity is 32 bytes. However, cache lines are 64 bytes long, so in practice you often have 2 consecutive bursts, hence 64 bytes read or written.

And for HBM memory (R9 Nano & R9 Fury) the channels are 128 bits wide, however the burst length is reduced to 2, maintaining the same granularity of 32 bytes.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: tromp on November 15, 2016, 09:21:20 AM

Quote from: nerdralph on November 14, 2016, 03:30:40 PM

Quote from: Genoil on November 14, 2016, 01:59:23 PM

On a more serious note: if you state the performance is now at 80% of theoretical maximum, we're basically there, right? ETH miners also peak at about 80-85% of the theoretical maximum. Does the same rule apply here?

I still have some work to do before I write my own miner from scratch. I like to *really* understand the problem before I start writing code, and there's still some parts of the GCN architecture that I'm figuring out.

Hi, nerdralph! Would you be interested in going after my Cuckoo Cycle bounties?

https://github.com/tromp/cuckoo

After a quick scan of your readme, it doesn't look appealing. While it might be fun, it doesn't seem to have any practical application in a popular cryptocurrency like BTC, ETH, or ZEC.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: Amph on November 15, 2016, 02:41:03 AM

Quote from: nerdralph on November 14, 2016, 08:23:59 PM

Quote from: mirny on November 14, 2016, 06:23:42 PM

Do you think, you can start developing miners for Nvidia?
There is a lack of skilled and honest developers.

Kind of hard when I don't have any Nvidia cards. Even if I did, I'd stick to OpenCL, as I have no interest in learning CUDA.

even if the potential might be higher? or maybe you know already that this will be never the case?

Based on my limited knowledge of Nvidia GPUs, I think they have 32-byte cache lines, which should give them an advantage for equihash.

Topic: limits of ZEC mining - page 2. (Read 10069 times)