Assessing the impact of TLB trashing on memory hard algorhitms - page 4.

apriyoni

sr. member

Activity: 308

Merit: 250

Do you have a binary for the variable chunk size? I wonder if the future ethminer can also let user choose the chuck size for optimization.

Genoil

sr. member

Activity: 438

Merit: 250

Quote from: Genoil on November 30, 2015, 10:23:57 AM

I've modified the sourcecode a bit to allocate in ~~256MB~~ user-definable chunks. Now it should be possible for AMD cards to get to use more RAM. On my GTX780, the hashrate curve is just about the same (tiny bit slower) when using 256MB chunks.

Genoil

sr. member

Activity: 438

Merit: 250

I've modified the sourcecode a bit to allocate in 256MB chunks. Now it should be possible for AMD cards to get to use more RAM. On my GTX780, the hashrate curve is just about the same (tiny bit slower) when using 256MB chunks.

gielbier

sr. member

Activity: 914

Merit: 250

Making Smart Money Work

I'm still trying to get my 7850 2GB do above 1280MB , but getting the out of memory error.
Even with
set GPU_MAX_ALLOC_PERCENT=100 / GPU_MAX_ALLOC_PERCENT=95
set GPU_MAX_HEAP_SIZE=100
set GPU_USE_SYNC_OBJECTS=1

Code:

DAG size (MB)	Bandwidth (GB/s)	Hashrate (MH/s)
128	130.915	17.1593
256	130.547	17.111
384	129.763	17.0083
512	129.429	16.9645
640	129.359	16.9553
768	129.501	16.9739
896	130.307	17.0796
1,024	130.303	17.0791
1,152	113.466	14.8722
1,280	103.826	13.6086

But it does seem to drop hard. from 1,024->1,280

Chunked (512) version below:

Code:

128	130.953	17.1643
256	130.552	17.1117
384	130.483	17.1027
512	129.715	17.002
640	160.314	21.0126
768	166.186	21.7823
896	162.538	21.3042
1,024	166.417	21.8126
1,152	135.096	17.7073
1,280	38.5741	5.05599
1,408	23.306	3.05476
1,536	17.4977	2.29346
1,664	12.6435	1.65721
1,792	12.2781	1.60932
1,920	10.8921	1.42764

Chunked (256) version below:

Code:

DAG size (MB)	Bandwidth (GB/s)	Hashrate (MH/s)
128	131.008	17.1715
256	130.584	17.1158
384	124.342	16.2977
512	114.388	14.9931
640	178.814	23.4376
768	160.401	21.0241
896	166.627	21.8401
1,024	156.984	20.5762
1,152	141.14	18.4996
1,280	123.989	16.2515
1,408	122.695	16.0819
1,536	51.0244	6.68787
1,664	29.0346	3.80563
1,792	21.4296	2.80881
1,920	17.2236	2.25754

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: MaxDZ8 on November 29, 2015, 04:05:57 AM

Interesting.
Please, can you share links to all discussions?

Considering OpenCL is possibly higher level than GL ever was, I'm quite surprised one pinpointed an hardware construct issue especially as GPUs are traditionally managed and there's a huge gap between different OS which in my experience should not be there for HW constructs... odd.

I have a 1GiB card so there's little I can do. I will try to take a look in the next few days if I can set apart some time. Initial analysis in CodeXL gave me inconsistent results.

Have you investigated different access patterns?

What different access patterns? The ones in Eth are pseudorandom over the whole DAG file, IIRC.

Genoil

sr. member

Activity: 438

Merit: 250

This thread on the CUDA forums is most relevant:
https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/
Somebody over there (@allnamac) wrote a completely independent test that verified my findings.

This is not so interesting but shows the problems affect both NVidia and AMD:
http://gathering.tweakers.net/forum/list_messages/1659186

MaxDZ8

hero member

Activity: 672

Merit: 500

Interesting.
Please, can you share links to all discussions?

Considering OpenCL is possibly higher level than GL ever was, I'm quite surprised one pinpointed an hardware construct issue especially as GPUs are traditionally managed and there's a huge gap between different OS which in my experience should not be there for HW constructs... odd.

I have a 1GiB card so there's little I can do. I will try to take a look in the next few days if I can set apart some time. Initial analysis in CodeXL gave me inconsistent results.

Have you investigated different access patterns?

Genoil

sr. member

Activity: 438

Merit: 250

During the development of the CUDA miner for Ethereum, I ran into an issue where the hashrate on GTX750Ti dramatically drops when the size of the memory buffer the miner operates on exceeds a certain threshold (1GB on Win7/Linux, 512MB on Win8/10). After a long discussion on the CUDA forums, one of the designers of CUDA weighed in and identified the issue as TLB trashing. I'm currently conducting a bit of research on the subject and have created a simple test program that measures these effects. It simulates the 'dagger' part of the Ethereum algorithm at different memory buffer (DAG) sizes and writes the results to a CSV file. So far, I have concluded that it is not an Nvidia-only issue, but manifests on AMD hardware as well. And apparently this is not an ETH-only issue, I've got some reports from srcypt-jane miners in as well.

I'm currently looking for as many as possible hardware/OS combinations to come to a recommendation for miners as well as designers of new algo's. Below is an example for ETH hashrate on GTX780 on Windows with increasing buffer size (in MB):

The test program can be dowloaded from https://github.com/Genoil/dagSimCL. Win-64 binaries are in the x64/Release folder. You can also build it yourself, but only have supporting MSVC files targetted at Nvidia OpenCL. On AMD hardware you may want to run

Code:

set GPU_MAX_ALLOC_PERCENT 100

first. By default, the program tries to use all of your GPU's RAM up until 4096MB. If you have less system RAM, you may add a cmd line param to test up until a lower maximum:

Code:

dagsimCL.exe 2048

If you have multiple GPU's, you need to add a second param:

Code:

dagsimCL.exe 4096 1

If you have multiple OpenCL platform installed:

Code:

dagsimCL.exe 4096 0 1

I would be very grateful if you could participate in this bit of research and possible discuss any workarounds. Thanks!

p.s. note that achieved hashrates with the test program can be significantly higher than what you actaully get with ethminer. This is because it only simautes the Dagge stages, not the Keccak stages.

Topic: Assessing the impact of TLB trashing on memory hard algorhitms - page 4. (Read 7723 times)