That comment refers to not requiring TLBs in order to access the tag RAM and see if there is a cache hit, so requests that hit in cache are a bit faster. If the access misses in cache, then the request has to go through page tables (accelerated via TLBs) in order to access the correct memory location.
Thanks for the explanation. Would the large page support be irrelevant for GCN assembler code that loads uncacheable data like the ethash DAG using FLAT_LOAD_DWORD with the SLC and GLC bits set? By bypassing L1 & L2, that should mean skipping the access to the tag RAM and the page tables. I haven't actually tried it myself yet, but comments by Wolf0 and zawawa seemed to say it was actually a bit slower.
Slower indeed. GLC alone is best. SLC alone, as well as SLC + GLC are worse than only GLC.
By the way, I finally took a look at why Claymore's "ASM" kernel for Ellesmere is so ass: Get this - he didn't actually do it properly, as in the whole thing in ASM, as he (arguably) implies. It's the output of the AMD OpenCL compiler (using the "-legacy" switch to make it use the older version) and then tweaked a bit. He even uses LDS... wtf. He DOES take advantage of ds_swizzle_b32, but he's still fuckin' wasting a lot of local mem writes + some reads, and the ds_swizzle_b32 is (IIRC) full-rate, but even so, he's wasting 4 clocks per when it could be done better. Additionally - this made me laugh IRL - his v10.0 miner looks for a kernel that does not exist. I was actually hella confused for a minute - double and triple checked the decoded GCN kernel binary (recovered from the memory of the miner process) - but, sure enough, it wasn't there. So, I finally figured, let's check the return value of clCreateKernel()... sure enough, it returns CL_INVALID_KERNEL_NAME (-46.) Apparently the miner (obviously) finds this error to be non-fatal and continues... but why the bloody fuck is it IN there?
Thanks for confirming. So bypassing L1 is faster, but bypassing L2 is slower. But speeding up L2 cache misses with larger page tables is faster...
According to GCN architecture docs, L2 does more than just cache, handling things such as global atomics. I now suspect it is also involved in queueing and arbitration for DRAM access from the CUs, which would explain the slowdown using SLC.
As for Claymore, ever since I first looked at his equihash kernels, I've thought he's a hack. He seems to get most of his ideas from other people, rather than trying to fully understand the GPU hardware and OpenCL compiler. Tweaking isa compiler output is something that zawawa was discussing early this year (in addition to llvm work to get inline asm working), so I suspect Claymore just got the idea from reading zawawa's posts.