Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 153.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: ioglnx on March 17, 2017, 03:23:09 PM

Taking a deep breathe and make two steps back.

That's hard to do, though... I'm getting 260 sol/s with stock RX 480 right now.
Got to check everything one more time.

ioglnx

sr. member

Activity: 574

Merit: 250

Fighting mob law and inquisition in this forum

Taking a deep breathe and make two steps back.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: jstefanop on March 17, 2017, 11:49:04 AM

Quote from: nerdralph on March 17, 2017, 08:03:04 AM

Quote from: zawawa on March 17, 2017, 04:07:54 AM

It seems that the real maximum size of GDS segments for RX 480 is 16KB.
It's a little disappointing, but still much better than 4KB without the kernel patch.
This number is also consistent with nertralph's report that Optiminer runs four CPU threads per GPU as GDS utilization can be maximized this way.
Now let me fix the kernel patch one more time...

Or maybe Optiminer runs just 2 instances of the kernel and uses only half of the GDS. It's possible (I'd even say probable) the benefits of using a full 64KB is offset by slower GDS access caused by contention with 4 instances of the kernel running.

Pretty sure both optiminer and claymore running two kernel threads.

It looks that way now... I think the Data Share unit is overloaded with my current implementation of Equihash.
What to do, what to do...

jstefanop

legendary

Activity: 2182

Merit: 1401

Quote from: nerdralph on March 17, 2017, 08:03:04 AM

Quote from: zawawa on March 17, 2017, 04:07:54 AM

It seems that the real maximum size of GDS segments for RX 480 is 16KB.
It's a little disappointing, but still much better than 4KB without the kernel patch.
This number is also consistent with nertralph's report that Optiminer runs four CPU threads per GPU as GDS utilization can be maximized this way.
Now let me fix the kernel patch one more time...

Or maybe Optiminer runs just 2 instances of the kernel and uses only half of the GDS. It's possible (I'd even say probable) the benefits of using a full 64KB is offset by slower GDS access caused by contention with 4 instances of the kernel running.

Pretty sure both optiminer and claymore running two kernel threads.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on March 17, 2017, 04:07:54 AM

It seems that the real maximum size of GDS segments for RX 480 is 16KB.
It's a little disappointing, but still much better than 4KB without the kernel patch.
This number is also consistent with nertralph's report that Optiminer runs four CPU threads per GPU as GDS utilization can be maximized this way.
Now let me fix the kernel patch one more time...

Or maybe Optiminer runs just 2 instances of the kernel and uses only half of the GDS. It's possible (I'd even say probable) the benefits of using a full 64KB is offset by slower GDS access caused by contention with 4 instances of the kernel running.

doktor83

hero member

Activity: 2548

Merit: 626

so good to see how much you love doing this zawawa

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

It seems that the real maximum size of GDS segments for RX 480 is 16KB.
It's a little disappointing, but still much better than 4KB without the kernel patch.
This number is also consistent with nertralph's report that Optiminer runs four CPU threads per GPU as GDS utilization can be maximized this way.
Now let me fix the kernel patch one more time...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Well, it turned out that multithreading wasn't working, so I'm still working on the kernel patch.
The miner is running about 15% faster with a single thread, so it's very promising.
I should be able to get rid of the patch entirely by hooking system calls to the driver later.
This is no easy stuff!

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

THE KERNEL PATCH IS WORKING!! WHOO HOO!!!!

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: john1010 on March 16, 2017, 05:02:52 AM

How many hash it will produce on ethereum usage? thanks

About 10% lower than Claymore's. You should try it yourself.

john1010

hero member

Activity: 2114

Merit: 562

How many hash it will produce on ethereum usage? thanks

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I just patched the Linux kernel as an experiment:

Code:

		if (1 /*gds*/) {
			p->job->gds_base = 0; // amdgpu_bo_gpu_offset(gds);
			p->job->gds_size = 65536; // amdgpu_bo_size(gds);
		}

This may actually work...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

This is the portion of the Linux kernel responsible for GDS-related parameters for compute kernels:

Code:

		if (gds) {
			p->job->gds_base = amdgpu_bo_gpu_offset(gds);
			p->job->gds_size = amdgpu_bo_size(gds);
		}

https://github.com/torvalds/linux/blob/ef96152e6a36e0510387cb174178b7982c1ae879/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

I should be able to change them by modifying the kernel source code.
I love free software!

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Let me see if this works...

https://github.com/rumpkernel/wiki/wiki/Howto:-Accessing-PCI-devices-from-userspace

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

It seems like it is possible to access PCIe devices directly in the user space on Windows, too:

https://msdn.microsoft.com/windows/hardware/drivers/wdf/comparing-umdf-2-0-functionality-to-kmdf

Well, there is something to learn every day...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on March 15, 2017, 08:49:54 AM

Quote from: zawawa on March 14, 2017, 03:00:24 PM

Quote from: nerdralph on March 14, 2017, 09:04:54 AM

Using SLC or GLC memory read/write may also give a small performance boost.

Could you elaborate on this? I recall you said Wolf was using them for his private miner, but I am not entirely sure how to use SLC/GLC bits for performance enhancements.

The SLC (System Level Coherence) bit forces bypassing the L1 cache, and GLC (Global Level Coherence) forces bypassing the L2. For ETH mining, which is 100% memory reads, I think SLC gave a performance improvement, but GLC did not. The results weren't completely intuitive, so you'll probably have to do some experimenting. I also suspect you may get different results from different GCN versions. Pitcairn and Tahiti seem to have a brain-dead cache controller that gets slower as the working set gets much over 1GB. Therefore I think GLC read/write may have a more significant impact for them vs Tonga (or even Hawaii).

Thanks for the clarification. Yeah, these features definitely make more sense for access to ETH's huge DAG. Let's see what I can do with them for ZEC...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: Ursul0 on March 15, 2017, 03:37:37 AM

EDIT: actually two of the cards just went SICK after 5 minutes, while Claymore works for days with no issues

That must be a hardware issue. Different miners tend to expose different hardware problems.
Also, for optimal performance with Ellesmere, you need to run the miner on Linux for now.
I think the only real technical advantage Claymore has over me is that he figured out how to access the entire GDS both on Windows and Linux.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on March 14, 2017, 03:00:24 PM

Quote from: nerdralph on March 14, 2017, 09:04:54 AM

Using SLC or GLC memory read/write may also give a small performance boost.

Could you elaborate on this? I recall you said Wolf was using them for his private miner, but I am not entirely sure how to use SLC/GLC bits for performance enhancements.

The SLC (System Level Coherence) bit forces bypassing the L1 cache, and GLC (Global Level Coherence) forces bypassing the L2. For ETH mining, which is 100% memory reads, I think SLC gave a performance improvement, but GLC did not. The results weren't completely intuitive, so you'll probably have to do some experimenting. I also suspect you may get different results from different GCN versions. Pitcairn and Tahiti seem to have a brain-dead cache controller that gets slower as the working set gets much over 1GB. Therefore I think GLC read/write may have a more significant impact for them vs Tonga (or even Hawaii).

Ursul0

sr. member

Activity: 857

Merit: 262

getting closer...

EDIT: actually two of the cards just went SICK after 5 minutes, while Claymore works for days with no issues

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I am not going to join this "Who got the fastest miner?" discussion.
Project GG's slogan is: "The best miner should be free."
I mean, Bitcoin founder Satoshi Nakamoto was so generous.
Why shouldn't we?

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 153. (Read 214458 times)