AMDGPU-Pro 17.40 with large page support

bridgman

newbie

Activity: 2

Merit: 0

Quote from: nerdralph on November 06, 2017, 02:18:08 PM

Thanks for the explanation. Would the large page support be irrelevant for GCN assembler code that loads uncacheable data like the ethash DAG using FLAT_LOAD_DWORD with the SLC and GLC bits set? By bypassing L1 & L2, that should mean skipping the access to the tag RAM and the page tables. I haven't actually tried it myself yet, but comments by Wolf0 and zawawa seemed to say it was actually a bit slower.

The large page support should still help if L1/L2 caches are being bypassed - if anything it would help more. Bypassing L1/L2 cache skips access to the cache's tag rams but does not skip access to the page tables, so the reduced TLB thrashing from large page support should still be relevant.

nerdralph

sr. member

Activity: 588

Merit: 251

I just noticed that although mining with >2GB DAG is faster with 2M page size, DAG creation time is much slower (3-4x longer).

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: ?? on ??

Quote from: nerdralph on November 06, 2017, 02:18:08 PM

Quote from: bridgman on November 05, 2017, 04:22:26 PM

Quote

I was confused by the comment about page size, since the AMD GCN Whitepaper claims "Like the L1 data cache, the L2 is virtually addressed, so no TLBs are required at all" on pg 10.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

That comment refers to not requiring TLBs in order to access the tag RAM and see if there is a cache hit, so requests that hit in cache are a bit faster. If the access misses in cache, then the request has to go through page tables (accelerated via TLBs) in order to access the correct memory location.

Thanks for the explanation. Would the large page support be irrelevant for GCN assembler code that loads uncacheable data like the ethash DAG using FLAT_LOAD_DWORD with the SLC and GLC bits set? By bypassing L1 & L2, that should mean skipping the access to the tag RAM and the page tables. I haven't actually tried it myself yet, but comments by Wolf0 and zawawa seemed to say it was actually a bit slower.

Slower indeed. GLC alone is best. SLC alone, as well as SLC + GLC are worse than only GLC.

By the way, I finally took a look at why Claymore's "ASM" kernel for Ellesmere is so ass: Get this - he didn't actually do it properly, as in the whole thing in ASM, as he (arguably) implies. It's the output of the AMD OpenCL compiler (using the "-legacy" switch to make it use the older version) and then tweaked a bit. He even uses LDS... wtf. He DOES take advantage of ds_swizzle_b32, but he's still fuckin' wasting a lot of local mem writes + some reads, and the ds_swizzle_b32 is (IIRC) full-rate, but even so, he's wasting 4 clocks per when it could be done better. Additionally - this made me laugh IRL - his v10.0 miner looks for a kernel that does not exist. I was actually hella confused for a minute - double and triple checked the decoded GCN kernel binary (recovered from the memory of the miner process) - but, sure enough, it wasn't there. So, I finally figured, let's check the return value of clCreateKernel()... sure enough, it returns CL_INVALID_KERNEL_NAME (-46.) Apparently the miner (obviously) finds this error to be non-fatal and continues... but why the bloody fuck is it IN there? Huh

Thanks for confirming. So bypassing L1 is faster, but bypassing L2 is slower. But speeding up L2 cache misses with larger page tables is faster...
According to GCN architecture docs, L2 does more than just cache, handling things such as global atomics. I now suspect it is also involved in queueing and arbitration for DRAM access from the CUs, which would explain the slowdown using SLC.

As for Claymore, ever since I first looked at his equihash kernels, I've thought he's a hack. He seems to get most of his ideas from other people, rather than trying to fully understand the GPU hardware and OpenCL compiler. Tweaking isa compiler output is something that zawawa was discussing early this year (in addition to llvm work to get inline asm working), so I suspect Claymore just got the idea from reading zawawa's posts.

stash2coin

jr. member

Activity: 108

Merit: 1

Could be some test kernel that he removes before public release and not bothering to remove the reference to it because is not causing problems. One more funny thing his ZEC miner is looking for Nvidia libraries but the miner is only for AMD cards

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: bridgman on November 05, 2017, 04:22:26 PM

Quote

I was confused by the comment about page size, since the AMD GCN Whitepaper claims "Like the L1 data cache, the L2 is virtually addressed, so no TLBs are required at all" on pg 10.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

That comment refers to not requiring TLBs in order to access the tag RAM and see if there is a cache hit, so requests that hit in cache are a bit faster. If the access misses in cache, then the request has to go through page tables (accelerated via TLBs) in order to access the correct memory location.

Thanks for the explanation. Would the large page support be irrelevant for GCN assembler code that loads uncacheable data like the ethash DAG using FLAT_LOAD_DWORD with the SLC and GLC bits set? By bypassing L1 & L2, that should mean skipping the access to the tag RAM and the page tables. I haven't actually tried it myself yet, but comments by Wolf0 and zawawa seemed to say it was actually a bit slower.

bridgman

newbie

Activity: 2

Merit: 0

Quote

I was confused by the comment about page size, since the AMD GCN Whitepaper claims "Like the L1 data cache, the L2 is virtually addressed, so no TLBs are required at all" on pg 10.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

That comment refers to not requiring TLBs in order to access the tag RAM and see if there is a cache hit, so requests that hit in cache are a bit faster. If the access misses in cache, then the request has to go through page tables (accelerated via TLBs) in order to access the correct memory location.

nerdralph

sr. member

Activity: 588

Merit: 251

http://support.amd.com/en-us/kb-articles/Pages/AMDGPU-PRO-Driver-for-Linux-Release-Notes.aspx
Anyone test it out yet on a card with a custom BIOS?

I was confused by the comment about page size, since the AMD GCN Whitepaper claims "Like the L1 data cache, the L2 is virtually addressed, so no TLBs are required at all" on pg 10.
https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf

However variable page size for a cache controller suggests a TLB, and comments on Phoronix seem to confirm this:
https://www.phoronix.com/forums/forum/linux-graphics-x-org-drivers/open-source-amd-linux/977778-amdgpu-increasing-fragment-size-for-performance

I'm guessing this is just another example of why the GCN docs can't be 100% relied on, and to find out what's really going on it is necessary to go through the driver code and do your own tweaks and testing.

Topic: AMDGPU-Pro 17.40 with large page support (Read 1039 times)