Custom RAM Timings for GPU's with GDDR5 - DOWNLOAD LINKS - UPDATED - page 45.

laik2

sr. member

Activity: 652

Merit: 266

I was implying nerdralphs current strap.

doktor83

hero member

Activity: 2548

Merit: 626

Quote from: laik2 on March 22, 2017, 01:52:26 AM

Quote from: OhGodAGirl on March 21, 2017, 09:33:43 PM

Quote from: laik2 on March 21, 2017, 04:34:38 PM

"If u get 29MH on 2000 it means your timings are too tight and won't be as stable"

This is not

And we cannot do it

laik2

sr. member

Activity: 652

Merit: 266

Quote from: OhGodAGirl on March 21, 2017, 09:33:43 PM

Quote from: laik2 on March 21, 2017, 04:34:38 PM

"If u get 29MH on 2000 it means your timings are too tight and won't be as stable"

This is not

And we cannot do it

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: Zorg33 on March 21, 2017, 04:45:00 AM

...finally getting interesting...

somebody might share Hynix AJR and Sam datasheet perhaps..?

I wonder how the optimized custom timings are related to the given minimal values in the official datasheet?
I mean how loose are the official values, are there even room for improvement below them, or custom straps simply are adjusted to the recommended official values...?

Those datasheets are not public.

lpedretti

full member

Activity: 152

Merit: 100

Quote from: Wolf0 on March 21, 2017, 09:32:06 PM

Quote from: nerdralph on March 21, 2017, 07:43:29 PM

Quote from: Wolf0 on March 21, 2017, 05:22:02 PM

Sometimes loosening the timings is better, and clocking higher - specifically on Eth.

I've seen you mention loosening the CAS timings. I tried bumping up tCL by 1, but still get crashes on the K4G4 at 2100. So is it just loosening tCL that usually does the trick, or something else too?

You have to loosen it on the DRAM, too - you're loosening the tCL on the ASIC, but not the DRAM, throwing them off.

Weren't the straps that control the dram settings precisely? You have to change values somewhere else?

OhGodAGirl

full member

Activity: 199

Merit: 108

Look, I'm really not that interesting. Promise.

Quote from: laik2 on March 21, 2017, 04:34:38 PM

"If u get 29MH on 2000 it means your timings are too tight and won't be as stable"

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: laik2 on March 21, 2017, 04:34:38 PM

Keep in mind that there is huge diff linux/windows and amdgpu-pro <16.60, I've wrote u on zawawa's thread to update kernel to 4.10/4.11 and install only amdgpu-pro 16.60 ocl packages and their deps. Hashrate will increase +1.2MH guaranteed. Also ras/cas timings must be equally calculated for better stability. MC_SEQ_MISC_TIMING contains tRP value which combined with tCL equals tRAS. By raising memory above 2000 you should increase refresh rate and keep read/write operations at the same level or close. ARB_DRAM timings can improve stability on driver level also. If u get 29MH on 2000 it means your timings are too tight and won't be as stable as you think.

Looks like part of the problem was throttling. I bumped up the TDP & TDC to 90W/90A, and am getting just over 27.5@2000. If 16.60 really gives me a 1.2Mh boost, I'll be pretty close to 29.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: Wolf0 on March 21, 2017, 05:22:02 PM

Sometimes loosening the timings is better, and clocking higher - specifically on Eth.

I've seen you mention loosening the CAS timings. I tried bumping up tCL by 1, but still get crashes on the K4G4 at 2100. So is it just loosening tCL that usually does the trick, or something else too?

kilo17

legendary

Activity: 980

Merit: 1001

aka "whocares"

Quote from: laik2 on March 21, 2017, 05:54:17 PM

Quote from: kilo17 on March 21, 2017, 05:26:04 PM

Quote from: Wolf0 on March 21, 2017, 05:22:02 PM

Sometimes loosening the timings is better, and clocking higher - specifically on Eth.

Great minds think alike Wink

And are modest Cheesy

tRAS on hynix and samsung is 0, Only Elpida and Micron seem to have tRAS value by default.
I haven't tried adding it on hynix but it might give some improvements on samsung.

I cannot remember off the top of my head but a think I have changed it on Samsung, or maybe I changed the other values to essentially imply a value. I will log on later and look.

laik2

sr. member

Activity: 652

Merit: 266

Quote from: kilo17 on March 21, 2017, 05:26:04 PM

Quote from: Wolf0 on March 21, 2017, 05:22:02 PM

Sometimes loosening the timings is better, and clocking higher - specifically on Eth.

Great minds think alike Wink

And are modest Cheesy

tRAS on hynix and samsung is 0, Only Elpida and Micron seem to have tRAS value by default.
I haven't tried adding it on hynix but it might give some improvements on samsung.

kilo17

legendary

Activity: 980

Merit: 1001

aka "whocares"

Quote from: Wolf0 on March 21, 2017, 05:22:02 PM

Sometimes loosening the timings is better, and clocking higher - specifically on Eth.

Great minds think alike Wink

kilo17

legendary

Activity: 980

Merit: 1001

aka "whocares"

Quote from: laik2 on March 21, 2017, 04:34:38 PM

Quote from: nerdralph on March 21, 2017, 04:20:06 PM

Quote from: nerdralph on March 21, 2017, 12:58:18 PM

I've started doing the detailed analysis on memory timing for Eth mining.

With tRRD=6, tRC=62, tCL=21 and 2000 mem clock, I can get almost 27Mh/s mining eth. Each hash takes 64 random DAG reads of 128 bytes each, and since they are random, each read should be from a different page. As well, the L2 cache hit rate should be near 0, so each DAG access requires a read from GDDR (2x32-byte reads from 2 GDDR chips).

Before reading, a page (row) has to be activated(opened), so 27Mh * 64 activate = 1728M activates per second. The Rx470/480 has 4 independent cache controllers, so a single GDDR5 chip will open 432M pages per second. With a 2Ghz mem clock, that's about 5 (4.73) clocks per activate. The closer that gets to 4, the better. Lower than 4 is not possible with Eth mining, since it takes 4 clocks to transfer 64 bytes (half a DAG entry). Note that if tRRD=6, means 6 clocks, some other timing factor is allowing the RAM to sustain <5 clocks per activate

I tried tRRD=5, and it only makes a small (~1%) improvement. That makes sense, since RRD is the delay between 2 activate commands when they are going to different banks. With only 16 banks, the memory controller has lots of opportunity to batch activate commands together in the same bank. However tRC is defined as, "The minimum time interval between two successive ACTIVE commands on the same bank". With tRC=62, the fastest access pattern would be to spread the accesses across different banks rather than batching them in the same bank.

So it seems I'm missing something about how the RAM timing. I know there are multiple clocks for GDDR5, and some run at double data rate (i.e. WCK). If tRRD=6 means six DDR address clocks, that would be 3 SDR command clocks (2Ghz is the command clock rate).

The GDDR5 specs refer to tRRDL (same bank group) and tRRDS (different bank group or bank groups disabled). Maybe what people are labeling tRRD is tRRDS, and some other data in the strap is tRRDL=4.
I tried reducing tRRD in SEQ_RAS_TIMING from 5 to 4, and don't see any improvement. I should be able to get ~29Mh with fully optimized timing at 2000 mem clock, but so far I can't get much more than 27.

Keep in mind that there is huge diff linux/windows and amdgpu-pro <16.60, I've wrote u on zawawa's thread to update kernel to 4.10/4.11 and install only amdgpu-pro 16.60 ocl packages and their deps. Hashrate will increase +1.2MH guaranteed. Also ras/cas timings must be equally calculated for better stability. MC_SEQ_MISC_TIMING contains tRP value which combined with tCL equals tRAS. By raising memory above 2000 you should increase refresh rate and keep read/write operations at the same level or close. ARB_DRAM timings can improve stability on driver level also. If u get 29MH on 2000 it means your timings are too tight and won't be as stable as you think.

The other option is to loosen them up a bit and increase the clock and add in a + offset. I am mostly referring to loosening the tRC to increase the stability and then change the tRAS and tRP accordingly. Also trying to maintain the "normal" parameters is helpful when altering the other values. For example, the common equation tRAS = tCL + tRCD + tRP -1 will help maintain the stabilty as well.

I have been messing around with these values for a while now and from my experience the best results are getting the stability I need first for a given target and then adjusting the "advanced" timings

Lastly, I haven't had much success or seen a lot of benefit changing the Read to Write delay or Write to Read delay but I am sure it can be helpful if the other timings are synced.

lexele

full member

Activity: 192

Merit: 100

Quote from: nerdralph on March 21, 2017, 04:20:06 PM

The GDDR5 specs refer to tRRDL (same bank group) and tRRDS (different bank group or bank groups disabled). Maybe what people are labeling tRRD is tRRDS, and some other data in the strap is tRRDL=4.
I tried reducing tRRD in SEQ_RAS_TIMING from 5 to 4, and don't see any improvement. I should be able to get ~29Mh with fully optimized timing at 2000 mem clock, but so far I can't get much more than 27.

On Rx 470/480 I get close to 29Mh by using 1375 straps on 2000 mem clock. By comparing the mem straps you could get some hints.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: laik2 on March 21, 2017, 04:34:38 PM

Keep in mind that there is huge diff linux/windows and amdgpu-pro <16.60, I've wrote u on zawawa's thread to update kernel to 4.10/4.11 and install only amdgpu-pro 16.60 ocl packages and their deps. Hashrate will increase +1.2MH guaranteed.

I saw that but I didn't catch that performance is significantly better on 4.10/16.60 than 4.8/16.40. I've only got 2 Rx470 cards in that rig, and have been meaning to drop in a R9 380 to test out AMDGPU-Pro's Tonga support. Maybe I'll upgrade to 16.60 as well...

laik2

sr. member

Activity: 652

Merit: 266

Quote from: nerdralph on March 21, 2017, 04:20:06 PM

Quote from: nerdralph on March 21, 2017, 12:58:18 PM

I've started doing the detailed analysis on memory timing for Eth mining.

With tRRD=6, tRC=62, tCL=21 and 2000 mem clock, I can get almost 27Mh/s mining eth. Each hash takes 64 random DAG reads of 128 bytes each, and since they are random, each read should be from a different page. As well, the L2 cache hit rate should be near 0, so each DAG access requires a read from GDDR (2x32-byte reads from 2 GDDR chips).

Before reading, a page (row) has to be activated(opened), so 27Mh * 64 activate = 1728M activates per second. The Rx470/480 has 4 independent cache controllers, so a single GDDR5 chip will open 432M pages per second. With a 2Ghz mem clock, that's about 5 (4.73) clocks per activate. The closer that gets to 4, the better. Lower than 4 is not possible with Eth mining, since it takes 4 clocks to transfer 64 bytes (half a DAG entry). Note that if tRRD=6, means 6 clocks, some other timing factor is allowing the RAM to sustain <5 clocks per activate

I tried tRRD=5, and it only makes a small (~1%) improvement. That makes sense, since RRD is the delay between 2 activate commands when they are going to different banks. With only 16 banks, the memory controller has lots of opportunity to batch activate commands together in the same bank. However tRC is defined as, "The minimum time interval between two successive ACTIVE commands on the same bank". With tRC=62, the fastest access pattern would be to spread the accesses across different banks rather than batching them in the same bank.

So it seems I'm missing something about how the RAM timing. I know there are multiple clocks for GDDR5, and some run at double data rate (i.e. WCK). If tRRD=6 means six DDR address clocks, that would be 3 SDR command clocks (2Ghz is the command clock rate).

The GDDR5 specs refer to tRRDL (same bank group) and tRRDS (different bank group or bank groups disabled). Maybe what people are labeling tRRD is tRRDS, and some other data in the strap is tRRDL=4.
I tried reducing tRRD in SEQ_RAS_TIMING from 5 to 4, and don't see any improvement. I should be able to get ~29Mh with fully optimized timing at 2000 mem clock, but so far I can't get much more than 27.

Keep in mind that there is huge diff linux/windows and amdgpu-pro <16.60, I've wrote u on zawawa's thread to update kernel to 4.10/4.11 and install only amdgpu-pro 16.60 ocl packages and their deps. Hashrate will increase +1.2MH guaranteed. Also ras/cas timings must be equally calculated for better stability. MC_SEQ_MISC_TIMING contains tRP value which combined with tCL equals tRAS. By raising memory above 2000 you should increase refresh rate and keep read/write operations at the same level or close. ARB_DRAM timings can improve stability on driver level also. If u get 29MH on 2000 it means your timings are too tight and won't be as stable as you think.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: nerdralph on March 21, 2017, 12:58:18 PM

I've started doing the detailed analysis on memory timing for Eth mining.

With tRRD=6, tRC=62, tCL=21 and 2000 mem clock, I can get almost 27Mh/s mining eth. Each hash takes 64 random DAG reads of 128 bytes each, and since they are random, each read should be from a different page. As well, the L2 cache hit rate should be near 0, so each DAG access requires a read from GDDR (2x32-byte reads from 2 GDDR chips).

Before reading, a page (row) has to be activated(opened), so 27Mh * 64 activate = 1728M activates per second. The Rx470/480 has 4 independent cache controllers, so a single GDDR5 chip will open 432M pages per second. With a 2Ghz mem clock, that's about 5 (4.73) clocks per activate. The closer that gets to 4, the better. Lower than 4 is not possible with Eth mining, since it takes 4 clocks to transfer 64 bytes (half a DAG entry). Note that if tRRD=6, means 6 clocks, some other timing factor is allowing the RAM to sustain <5 clocks per activate

I tried tRRD=5, and it only makes a small (~1%) improvement. That makes sense, since RRD is the delay between 2 activate commands when they are going to different banks. With only 16 banks, the memory controller has lots of opportunity to batch activate commands together in the same bank. However tRC is defined as, "The minimum time interval between two successive ACTIVE commands on the same bank". With tRC=62, the fastest access pattern would be to spread the accesses across different banks rather than batching them in the same bank.

So it seems I'm missing something about how the RAM timing. I know there are multiple clocks for GDDR5, and some run at double data rate (i.e. WCK). If tRRD=6 means six DDR address clocks, that would be 3 SDR command clocks (2Ghz is the command clock rate).

The GDDR5 specs refer to tRRDL (same bank group) and tRRDS (different bank group or bank groups disabled). Maybe what people are labeling tRRD is tRRDS, and some other data in the strap is tRRDL=4.
I tried reducing tRRD in SEQ_RAS_TIMING from 5 to 4, and don't see any improvement. I should be able to get ~29Mh with fully optimized timing at 2000 mem clock, but so far I can't get much more than 27.

laik2

sr. member

Activity: 652

Merit: 266

Quote from: nerdralph on March 21, 2017, 03:01:21 PM

I've been looking at more of the strap data that most people ignore, and I'm thinking there could be some important data there. Based on some information in a PM, after MC_SEQ_MISC_TIMING2 comes MC_SEQ_MISC1, which supposedly contains mode register 1/0, and the next 32 bits is mode register 5/4. GDDR5 specifies 10 Mode Registers to define the specific mode of operation. Some of the mode register data appears to be duplicated elsewhere in the strap, while some is not. For example MC_SEQ_MISC_TIMING2 has CRC read/write latency, and that is part of MR4 (2 bits for read latency and 3 bits for write latency).

I seem to recall both Wolf and Eliovp mentioning there is some important timing in the mode registers.

If you've read the document you mentioned then you should know what's in there already.
ElioVP exposed Mode registers to you.

nerdralph

sr. member

Activity: 588

Merit: 251

I've been looking at more of the strap data that most people ignore, and I'm thinking there could be some important data there. Based on some information in a PM, after MC_SEQ_MISC_TIMING2 comes MC_SEQ_MISC1, which supposedly contains mode register 1/0, and the next 32 bits is mode register 5/4. GDDR5 specifies 10 Mode Registers to define the specific mode of operation. Some of the mode register data appears to be duplicated elsewhere in the strap, while some is not. For example MC_SEQ_MISC_TIMING2 has CRC read/write latency, and that is part of MR4 (2 bits for read latency and 3 bits for write latency).

I seem to recall both Wolf and Eliovp mentioning there is some important timing in the mode registers.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: niko2004x on March 21, 2017, 01:45:22 PM

Quote from: nerdralph on March 21, 2017, 01:06:30 PM

Quote from: doktor83 on March 21, 2017, 12:56:33 PM

yeah, just wanted to ask which one is accurate, ohgod's or niko's MISC_TIMING cause one is 31 bits the other one is 32

I think the linux kernel asic reg headers are misleading. As far as I can tell the straps are copied into 32-bit registers, and therefore the mask and offset definitions have no functional effect.
Some of the old register names can't even be found in the GDDR5 datasheets. For example you won't find tR2R in the Hynix datasheet, but you will find tCCDL and tCCDS. I suspect what the Linux headers refer to as tR2R may actually be tCCDS.

Well, you could be right.
But linked Hynix H5GQ2H24AFR (last seen in R9 290) is dated by 2009 and
linux header is more recent (although if data is up to date here is questionable)
and from my point of view it is about which one is more deprecated.

I'm confident tCCDS is part of the JEDEC GDDR5 spec, and therefore not unique to the H5GQ2H24AFR.

laik2

sr. member

Activity: 652

Merit: 266

Quote from: niko2004x on March 21, 2017, 01:46:01 PM

Quote from: laik2 on March 21, 2017, 01:44:26 PM

Quote from: niko2004x on March 21, 2017, 01:03:14 PM

Quote from: doktor83 on March 21, 2017, 12:56:33 PM

yeah, just wanted to ask which one is accurate, ohgod's or niko's MISC_TIMING cause one is 31 bits the other one is 32

3 highest bits are unused anyway (so difference between 31 and 32 is irrelevant).

And whats the correct structure for MC_SEQ_MISC_TIMING according to your decoding tool for RX series?

As stated in atom_rom_timings.py in git.

Oh Gosh...I'm an idiot I forgot u've made it public Cheesy

Topic: Custom RAM Timings for GPU's with GDDR5 - DOWNLOAD LINKS - UPDATED - page 45. (Read 155721 times)