Pages:
Author

Topic: AMD core/memory clock ratio fundamental limit 0.5 (Read 2159 times)

legendary
Activity: 1120
Merit: 1000
I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.

Try reading a GDDR5 datasheet.  Every mfr supports at least 8 concurrent open pages.  2 bursts (4 clocks) from the same page are required to max the I/O bandwidth due to the 1:1 command:burst ratio.  In the context of the Hawaii cards with a memory clock of 1250 or 1300Mhz, faster page activation times than stock make no difference in the data xfer rate.

When the time for 16 burst xfers is less than the time to activate a page, only then will page activate timing impact bandwidth for the minimum 64-byte xfers.  You'll run into that issue with an Rx 470 at 1750Mhz, but not on a R9 290 at 1250.


I have read almost the entire Hynix GDDR5 data sheet, and am probably one of only a handful of people that have decoded the entire memory strap region of the ATOM bios. Of course timings wouldnt matter for slower clocks...but no one on these forums run their memory clocks at 1250 or 1750 Wink

One of the few except for people like me that have mapped out the ENTIRE BIOS with some little tools like atomworks---  Tongue

sr. member
Activity: 588
Merit: 251
The point however is that your "theoretical optimal" ratio has ZERO basis in the real world for some reason, as dropping my R9 290s to 625/1250 or my R9 280x cards to 750/1500 would result in a HUGE hashrate drop from their currently optimal 1100/1250 and 1100/1500 respective settings, which indicates your theory is incomplete or incorrect.

Congrats on knocking down the straw man.  I said the 0.5 clock ratio limit is a fundamental limit below which it would be impossible to max out the memory bandwidth.  In other words, even if you wrote the most efficient OpenCL (or GCN assembler) kernel possible, you couldn't drop the core clock lower than 1/2 the memory clock without impacting the memory bandwidth.
legendary
Activity: 1498
Merit: 1030

I have read almost the entire Hynix GDDR5 data sheet, and am probably one of only a handful of people that have decoded the entire memory strap region of the ATOM bios. Of course timings wouldnt matter for slower clocks...but no one on these forums run their memory clocks at 1250 or 1750 Wink

 Perhaps not on an RX series card, but a lot of us run the older R9 2xx and 3xx series cards (and a few of us run even older HD 78xx/79xx series cards) that can't DO 1750 Mhz memory and in some cases can't even do 1500.

legendary
Activity: 1498
Merit: 1030
I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.

Try reading a GDDR5 datasheet.  Every mfr supports at least 8 concurrent open pages.  2 bursts (4 clocks) from the same page are required to max the I/O bandwidth due to the 1:1 command:burst ratio.  In the context of the Hawaii cards with a memory clock of 1250 or 1300Mhz, faster page activation times than stock make no difference in the data xfer rate.

When the time for 16 burst xfers is less than the time to activate a page, only then will page activate timing impact bandwidth for the minimum 64-byte xfers.  You'll run into that issue with an Rx 470 at 1750Mhz, but not on a R9 290 at 1250.


 Try the real world - BIOS by TheStilt were big mostly because they optimised memory timings and gave VERY NOTICEABLE improvements in actual memory throughput.
 Even before I played with the core clocks, just flashing the BIOS gave me a noticeable hashrate improvement on my set of R9 290s on ETH, while DROPPING the memory clock from the previous optimal 1350 to 1250 after the BIOS change increased hashrate as well.
 *THEN* I started playing with the core clock - and I found the optimal was to max it as high as it could go within the limits of overheating stability.


 The point however is that your "theoretical optimal" ratio has ZERO basis in the real world for some reason, as dropping my R9 290s to 625/1250 or my R9 280x cards to 750/1500 would result in a HUGE hashrate drop from their currently optimal 1100/1250 and 1100/1500 respective settings, which indicates your theory is incomplete or incorrect.


 I can't speak to the RX series cards as I don't own any of those.
legendary
Activity: 1260
Merit: 1006
Mine for a Bit
I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.

Try reading a GDDR5 datasheet.  Every mfr supports at least 8 concurrent open pages.  2 bursts (4 clocks) from the same page are required to max the I/O bandwidth due to the 1:1 command:burst ratio.  In the context of the Hawaii cards with a memory clock of 1250 or 1300Mhz, faster page activation times than stock make no difference in the data xfer rate.

When the time for 16 burst xfers is less than the time to activate a page, only then will page activate timing impact bandwidth for the minimum 64-byte xfers.  You'll run into that issue with an Rx 470 at 1750Mhz, but not on a R9 290 at 1250.


I have read almost the entire Hynix GDDR5 data sheet, and am probably one of only a handful of people that have decoded the entire memory strap region of the ATOM bios. Of course timings wouldnt matter for slower clocks...but no one on these forums run their memory clocks at 1250 or 1750 Wink

Here is my 7 GPU RIG.

More details at: https://bitcointalksearch.org/topic/gpu-mining-rig-eth-zec-6-or-7-gpu-1676763



1015 H/s
7 MSI RX470s
968 Watts at the wall
68 ave Watts (MSI Afterburner)
57C ave GPU Temp (MSI Afterburner)
49C CPU Temp
1500 Strap
MSI Afterburner settings: 1950 Memory Clock, 1200 Core Clock

My main goal is profitability = W/Hs (less is better)
Previously I was at .97 W/Hs
I am now at .95 W/Hs

If there is a better setting that will increase efficiency; sense, profitability I would love to know it.   OR if I am doing something that will hurt my equipment in the long run.

More details at: https://bitcointalksearch.org/topic/gpu-mining-rig-eth-zec-6-or-7-gpu-1676763
legendary
Activity: 2182
Merit: 1401
I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.

Try reading a GDDR5 datasheet.  Every mfr supports at least 8 concurrent open pages.  2 bursts (4 clocks) from the same page are required to max the I/O bandwidth due to the 1:1 command:burst ratio.  In the context of the Hawaii cards with a memory clock of 1250 or 1300Mhz, faster page activation times than stock make no difference in the data xfer rate.

When the time for 16 burst xfers is less than the time to activate a page, only then will page activate timing impact bandwidth for the minimum 64-byte xfers.  You'll run into that issue with an Rx 470 at 1750Mhz, but not on a R9 290 at 1250.


I have read almost the entire Hynix GDDR5 data sheet, and am probably one of only a handful of people that have decoded the entire memory strap region of the ATOM bios. Of course timings wouldnt matter for slower clocks...but no one on these forums run their memory clocks at 1250 or 1750 Wink
legendary
Activity: 1260
Merit: 1006
Mine for a Bit
I previously posted that for modern AMD GPUs like Tonga and Polaris that have a 256-bit memory interface, the minimum core/clock ratio for eth mining is ~0.56.  After digging through lots of GCN architecture documents for my analysis of ZEC mining, I have determined that the fundamental limit for fully utilizing the GDDR5 memory is 0.5.  This means on a Rx 480 with memory clocked at 2Ghz, the core has to be at least 1Ghz in order to fully utilize the GDDR5 memory bandwidth.  The reason is that the L2 cache can transfer at most 64 bytes per core clock.  GDDR5 transfers 4 bits per clock cycle in a 2 cycle burst.  The GDDR memory interface is 32-bits wide, so each chip xfers 32 bytes per 2 clocks.  Tonga and Polaris use 8 GGDR5 chips for 8 x 32 = 256 bits on the external memory bus.



do you assume memory clock with stock timing, or tighter strap timing mod is not relevant to ratio ?

My RX470-4G running at 1150MHz/1975Mhz and 1500Mhz strap memory timing, mining ZEC at 140 sol/s using claymore

Based upon the ratio that you are suggesting, are you seeing that 1200 core 1950 mem -OR- 1270 core 1750 mem would be a better ratio for the MSI RX470 X GPU?  They both net the same results in H/s.
legendary
Activity: 1176
Merit: 1015

none of the LINUX miners for ZEC are even close to competative on hashrate to the Windows ones.


What's wrong with optiminer?
sr. member
Activity: 588
Merit: 251
I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.

Try reading a GDDR5 datasheet.  Every mfr supports at least 8 concurrent open pages.  2 bursts (4 clocks) from the same page are required to max the I/O bandwidth due to the 1:1 command:burst ratio.  In the context of the Hawaii cards with a memory clock of 1250 or 1300Mhz, faster page activation times than stock make no difference in the data xfer rate.

When the time for 16 burst xfers is less than the time to activate a page, only then will page activate timing impact bandwidth for the minimum 64-byte xfers.  You'll run into that issue with an Rx 470 at 1750Mhz, but not on a R9 290 at 1250.
legendary
Activity: 2182
Merit: 1401
I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.


Of course it does...if a memory controller has to wait more clock cycles before it can read or write to a memory bank, its effective I/O bandwidth is reduced.
sr. member
Activity: 588
Merit: 251
I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.

GDDR5 RAS/CAS timings has nothing to do with the bandwidth of the data xfer.
legendary
Activity: 1498
Merit: 1030
Doesn't match up to reality on the R9 290.

 Best results I get out of my R9 290s on ETH (running ANY of the ETH miners) has been with 1100 core / 1250 memory clocks, at least since I flashed them with one of TheStilt BIOS (before that they overheated long before I could get them to 1000).

 I can't speak to ZEC on them as ALL of my AMD miners are on LINUX and none of the LINUX miners for ZEC are even close to competative on hashrate to the Windows ones.


 I see the same thing on my R9 280x, best ETH results at 1100/1500.

 In both cases, downclocking core OR memory clock result in a decrease in hashrate.



 I suspect the situation is a LOT more complex than you believe, due to interaction of multiple compute units trying to access memory at the same time, wait state timings on the actual memory, etc.


legendary
Activity: 1470
Merit: 1024
Rx 480 1100/2000 = 135Sol
Rx 480 1300/2000=155Sol

This is not due to any fundamental limit of the GPU architecture, just how well optimized the miner code is.


I dont know if you are able to inspect optiminer code but it seems to me it is better coded since it gives the same hashrate with claymore without bottlenecking my cpu and insane watt usage, thus heat. Also I was getting better results at low core high mem. Is claymore pushing intensity to compete atm?
hero member
Activity: 1400
Merit: 505
I previously posted that for modern AMD GPUs like Tonga and Polaris that have a 256-bit memory interface, the minimum core/clock ratio for eth mining is ~0.56.  After digging through lots of GCN architecture documents for my analysis of ZEC mining, I have determined that the fundamental limit for fully utilizing the GDDR5 memory is 0.5.  This means on a Rx 480 with memory clocked at 2Ghz, the core has to be at least 1Ghz in order to fully utilize the GDDR5 memory bandwidth.  The reason is that the L2 cache can transfer at most 64 bytes per core clock.  GDDR5 transfers 4 bits per clock cycle in a 2 cycle burst.  The GDDR memory interface is 32-bits wide, so each chip xfers 32 bytes per 2 clocks.  Tonga and Polaris use 8 GGDR5 chips for 8 x 32 = 256 bits on the external memory bus.



do you assume memory clock with stock timing, or tighter strap timing mod is not relevant to ratio ?

My RX470-4G running at 1150MHz/1975Mhz and 1500Mhz strap memory timing, mining ZEC at 140 sol/s using claymore
sr. member
Activity: 588
Merit: 251
Rx 480 1100/2000 = 135Sol
Rx 480 1300/2000=155Sol

This is not due to any fundamental limit of the GPU architecture, just how well optimized the miner code is.
sr. member
Activity: 588
Merit: 251
What about for the Tahiti and the Hawaii GPUs? Whats the ratio there?

Haven't looked at the Tahiti specs.  Hawaii uses the same L2 architecture as Tonga, except twice the number of memory channels.  Therefore the same 0.5x ratio would apply.  I know a R9 290 won't get the maximum hashrate for eth with a core clock at 1/2 the memory clock, but it likely would if it had more compute units.  Since a R9 380 with 28 CUs gets the max performance on eth at the 0.56 ratio, a Hawaii GPU with 56 CUs would probably max out at the same 0.56 ratio.

I also suspect a more efficient miner (i.e. Wolf's miner written in GCN assembler) would make better use of the 40 CUs on a R9 290, and therefore could still get 36-37Mh mining eth with a 1250Mhz memory clock and a 700-750Mhz core clock.
full member
Activity: 169
Merit: 100
Rx 480 1100/2000 = 135Sol
Rx 480 1300/2000=155Sol
legendary
Activity: 3808
Merit: 1723
What about for the Tahiti and the Hawaii GPUs? Whats the ratio there?

sr. member
Activity: 588
Merit: 251
Can you comment on the fact that with latest optiminer/claymore releases it seems better to have a higher core/lower mem. Reported from me and several other people.

This just indicates their kernel code could be more efficient, i.e. better optimized.

legendary
Activity: 1470
Merit: 1024
Can you comment on the fact that with latest optiminer/claymore releases it seems better to have a higher core/lower mem. Reported from me and several other people.
Pages:
Jump to: