Author

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 153. (Read 214410 times)

sr. member
Activity: 728
Merit: 304
Miner Developer
Taking a deep breathe and make two steps back.


That's hard to do, though... I'm getting 260 sol/s with stock RX 480 right now.
Got to check everything one more time.
sr. member
Activity: 574
Merit: 250
Fighting mob law and inquisition in this forum
Taking a deep breathe and make two steps back.
sr. member
Activity: 728
Merit: 304
Miner Developer
It seems that the real maximum size of GDS segments for RX 480 is 16KB.
It's a little disappointing, but still much better than 4KB without the kernel patch.
This number is also consistent with nertralph's report that Optiminer runs four CPU threads per GPU as GDS utilization can be maximized this way.
Now let me fix the kernel patch one more time...

Or maybe Optiminer runs just 2 instances of the kernel and uses only half of the GDS.  It's possible (I'd even say probable) the benefits of using a full 64KB is offset by slower GDS access caused by contention with 4 instances of the kernel running.


Pretty sure both optiminer and claymore running two kernel threads.

It looks that way now... I think the Data Share unit is overloaded with my current implementation of Equihash.
What to do, what to do...
legendary
Activity: 2156
Merit: 1400
It seems that the real maximum size of GDS segments for RX 480 is 16KB.
It's a little disappointing, but still much better than 4KB without the kernel patch.
This number is also consistent with nertralph's report that Optiminer runs four CPU threads per GPU as GDS utilization can be maximized this way.
Now let me fix the kernel patch one more time...

Or maybe Optiminer runs just 2 instances of the kernel and uses only half of the GDS.  It's possible (I'd even say probable) the benefits of using a full 64KB is offset by slower GDS access caused by contention with 4 instances of the kernel running.


Pretty sure both optiminer and claymore running two kernel threads.
sr. member
Activity: 588
Merit: 251
It seems that the real maximum size of GDS segments for RX 480 is 16KB.
It's a little disappointing, but still much better than 4KB without the kernel patch.
This number is also consistent with nertralph's report that Optiminer runs four CPU threads per GPU as GDS utilization can be maximized this way.
Now let me fix the kernel patch one more time...

Or maybe Optiminer runs just 2 instances of the kernel and uses only half of the GDS.  It's possible (I'd even say probable) the benefits of using a full 64KB is offset by slower GDS access caused by contention with 4 instances of the kernel running.
hero member
Activity: 2548
Merit: 626
so good to see how much you love doing this zawawa Smiley
sr. member
Activity: 728
Merit: 304
Miner Developer
It seems that the real maximum size of GDS segments for RX 480 is 16KB.
It's a little disappointing, but still much better than 4KB without the kernel patch.
This number is also consistent with nertralph's report that Optiminer runs four CPU threads per GPU as GDS utilization can be maximized this way.
Now let me fix the kernel patch one more time...
sr. member
Activity: 728
Merit: 304
Miner Developer
Well, it turned out that multithreading wasn't working, so I'm still working on the kernel patch.
The miner is running about 15% faster with a single thread, so it's very promising.
I should be able to get rid of the patch entirely by hooking system calls to the driver later.
This is no easy stuff!
sr. member
Activity: 728
Merit: 304
Miner Developer
THE KERNEL PATCH IS WORKING!! WHOO HOO!!!!
sr. member
Activity: 728
Merit: 304
Miner Developer
How many hash it will produce on ethereum usage? thanks

About 10% lower than Claymore's. You should try it yourself.
hero member
Activity: 2086
Merit: 562
How many hash it will produce on ethereum usage? thanks
sr. member
Activity: 728
Merit: 304
Miner Developer
I just patched the Linux kernel as an experiment:

Code:
		if (1 /*gds*/) {
p->job->gds_base = 0; // amdgpu_bo_gpu_offset(gds);
p->job->gds_size = 65536; // amdgpu_bo_size(gds);
}

This may actually work...
sr. member
Activity: 728
Merit: 304
Miner Developer
This is the portion of the Linux kernel responsible for GDS-related parameters for compute kernels:

Code:
		if (gds) {
p->job->gds_base = amdgpu_bo_gpu_offset(gds);
p->job->gds_size = amdgpu_bo_size(gds);
}
https://github.com/torvalds/linux/blob/ef96152e6a36e0510387cb174178b7982c1ae879/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

I should be able to change them by modifying the kernel source code.
I love free software!
sr. member
Activity: 728
Merit: 304
Miner Developer
sr. member
Activity: 728
Merit: 304
Miner Developer
It seems like it is possible to access PCIe devices directly in the user space on Windows, too:

https://msdn.microsoft.com/windows/hardware/drivers/wdf/comparing-umdf-2-0-functionality-to-kmdf

Well, there is something to learn every day...
sr. member
Activity: 728
Merit: 304
Miner Developer
Using SLC or GLC memory read/write may also give a small performance boost.

Could you elaborate on this? I recall you said Wolf was using them for his private miner, but I am not entirely sure how to use SLC/GLC bits for performance enhancements.

The SLC (System Level Coherence) bit forces bypassing the L1 cache, and GLC (Global Level Coherence) forces bypassing the L2.  For ETH mining, which is 100% memory reads, I think SLC gave a performance improvement, but GLC did not.  The results weren't completely intuitive, so you'll probably have to do some experimenting.  I also suspect you may get different results from different GCN versions.  Pitcairn and Tahiti seem to have a brain-dead cache controller that gets slower as the working set gets much over 1GB.  Therefore I think GLC read/write may have a more significant impact for them vs Tonga (or even Hawaii).


Thanks for the clarification. Yeah, these features definitely make more sense for access to ETH's huge DAG. Let's see what I can do with them for ZEC...
sr. member
Activity: 728
Merit: 304
Miner Developer
EDIT: actually two of the cards just went SICK after 5 minutes, while Claymore works for days with no issues

That must be a hardware issue. Different miners tend to expose different hardware problems.
Also, for optimal performance with Ellesmere, you need to run the miner on Linux for now.
I think the only real technical advantage Claymore has over me is that he figured out how to access the entire GDS both on Windows and Linux.
sr. member
Activity: 588
Merit: 251
Using SLC or GLC memory read/write may also give a small performance boost.

Could you elaborate on this? I recall you said Wolf was using them for his private miner, but I am not entirely sure how to use SLC/GLC bits for performance enhancements.

The SLC (System Level Coherence) bit forces bypassing the L1 cache, and GLC (Global Level Coherence) forces bypassing the L2.  For ETH mining, which is 100% memory reads, I think SLC gave a performance improvement, but GLC did not.  The results weren't completely intuitive, so you'll probably have to do some experimenting.  I also suspect you may get different results from different GCN versions.  Pitcairn and Tahiti seem to have a brain-dead cache controller that gets slower as the working set gets much over 1GB.  Therefore I think GLC read/write may have a more significant impact for them vs Tonga (or even Hawaii).
sr. member
Activity: 857
Merit: 262
getting closer...


EDIT: actually two of the cards just went SICK after 5 minutes, while Claymore works for days with no issues
sr. member
Activity: 728
Merit: 304
Miner Developer
I am not going to join this "Who got the fastest miner?" discussion.
Project GG's slogan is: "The best miner should be free."
I mean, Bitcoin founder Satoshi Nakamoto was so generous.
Why shouldn't we?
Jump to: