Pages:
Author

Topic: Assessing the impact of TLB trashing on memory hard algorhitms - page 4. (Read 7704 times)

sr. member
Activity: 308
Merit: 250
Do you have a binary for the variable chunk size? I wonder if the future ethminer can also let user choose the chuck size for optimization.
sr. member
Activity: 438
Merit: 250
I've modified the sourcecode a bit to allocate in 256MB user-definable chunks. Now it should be possible for AMD cards to get to use more RAM. On my GTX780, the hashrate curve is just about the same (tiny bit slower) when using 256MB chunks.   
sr. member
Activity: 438
Merit: 250
I've modified the sourcecode a bit to allocate in 256MB chunks. Now it should be possible for AMD cards to get to use more RAM. On my GTX780, the hashrate curve is just about the same (tiny bit slower) when using 256MB chunks.   
sr. member
Activity: 914
Merit: 250
Making Smart Money Work
I'm still trying to get my 7850 2GB do above 1280MB , but getting the out of memory error.  
Even with
set GPU_MAX_ALLOC_PERCENT=100 / GPU_MAX_ALLOC_PERCENT=95
set GPU_MAX_HEAP_SIZE=100
set GPU_USE_SYNC_OBJECTS=1

Code:
DAG size (MB)	Bandwidth (GB/s)	Hashrate (MH/s)
128 130.915 17.1593
256 130.547 17.111
384 129.763 17.0083
512 129.429 16.9645
640 129.359 16.9553
768 129.501 16.9739
896 130.307 17.0796
1,024 130.303 17.0791
1,152 113.466 14.8722
1,280 103.826 13.6086
But it does seem to drop hard. from 1,024->1,280

Chunked (512) version below:
Code:
128	130.953	17.1643
256 130.552 17.1117
384 130.483 17.1027
512 129.715 17.002
640 160.314 21.0126
768 166.186 21.7823
896 162.538 21.3042
1,024 166.417 21.8126
1,152 135.096 17.7073
1,280 38.5741 5.05599
1,408 23.306 3.05476
1,536 17.4977 2.29346
1,664 12.6435 1.65721
1,792 12.2781 1.60932
1,920 10.8921 1.42764

Chunked (256) version below:
Code:
DAG size (MB)	Bandwidth (GB/s)	Hashrate (MH/s)
128 131.008 17.1715
256 130.584 17.1158
384 124.342 16.2977
512 114.388 14.9931
640 178.814 23.4376
768 160.401 21.0241
896 166.627 21.8401
1,024 156.984 20.5762
1,152 141.14 18.4996
1,280 123.989 16.2515
1,408 122.695 16.0819
1,536 51.0244 6.68787
1,664 29.0346 3.80563
1,792 21.4296 2.80881
1,920 17.2236 2.25754
member
Activity: 81
Merit: 1002
It was only the wind.
Interesting.
Please, can you share links to all discussions?

Considering OpenCL is possibly higher level than GL ever was, I'm quite surprised one pinpointed an hardware construct issue especially as GPUs are traditionally managed and there's a huge gap between different OS which in my experience should not be there for HW constructs... odd.

I have a 1GiB card so there's little I can do. I will try to take a look in the next few days if I can set apart some time. Initial analysis in CodeXL gave me inconsistent results.

Have you investigated different access patterns?

What different access patterns? The ones in Eth are pseudorandom over the whole DAG file, IIRC.
sr. member
Activity: 438
Merit: 250
This thread on the CUDA forums is most relevant:
https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/
Somebody over there (@allnamac) wrote a completely independent test that verified my findings.

This is not so interesting but shows the problems affect both NVidia and AMD:
http://gathering.tweakers.net/forum/list_messages/1659186



hero member
Activity: 672
Merit: 500
Interesting.
Please, can you share links to all discussions?

Considering OpenCL is possibly higher level than GL ever was, I'm quite surprised one pinpointed an hardware construct issue especially as GPUs are traditionally managed and there's a huge gap between different OS which in my experience should not be there for HW constructs... odd.

I have a 1GiB card so there's little I can do. I will try to take a look in the next few days if I can set apart some time. Initial analysis in CodeXL gave me inconsistent results.

Have you investigated different access patterns?
sr. member
Activity: 438
Merit: 250
During the development of the CUDA miner for Ethereum, I ran into an issue where the hashrate on GTX750Ti dramatically drops when the size of the memory buffer the miner operates on exceeds a certain threshold (1GB on Win7/Linux, 512MB on Win8/10). After a long discussion on the CUDA forums, one of the designers of CUDA weighed in and identified the issue as TLB trashing. I'm currently conducting a bit of research on the subject and have created a simple test program that measures these effects. It simulates the 'dagger' part of the Ethereum algorithm at different memory buffer (DAG) sizes and writes the results to a CSV file. So far, I have concluded that it is not an Nvidia-only issue, but manifests on AMD hardware as well. And apparently this is not an ETH-only issue, I've got some reports from srcypt-jane miners in as well.

I'm currently looking for as many as possible hardware/OS combinations to come to a recommendation for miners as well as designers of new algo's. Below is an example for ETH hashrate on GTX780 on Windows with increasing buffer size (in MB):



The test program can be dowloaded from https://github.com/Genoil/dagSimCL. Win-64 binaries are in the x64/Release folder. You can also build it yourself, but only have supporting MSVC files targetted at Nvidia OpenCL. On AMD hardware you may want to run

Code:
set GPU_MAX_ALLOC_PERCENT 100

first. By default, the program tries to use all of your GPU's RAM up until 4096MB. If you have less system RAM, you may add a cmd line param to test up until a lower maximum:

Code:
dagsimCL.exe 2048

If you have multiple GPU's, you need to add a second param:

Code:
dagsimCL.exe 4096 1

If you have multiple OpenCL platform installed:

Code:
dagsimCL.exe 4096 0 1

I would be very grateful if you could participate in this bit of research and possible discuss any workarounds. Thanks!

p.s. note that achieved hashrates with the test program can be significantly higher than what you actaully get with ethminer. This is because it only simautes the Dagge stages, not the Keccak stages.


Pages:
Jump to: