Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 163.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

God grief, they are not checking GCN address spaces at all!
This is unreal...

Code:

bool llvm::isAllocaPromotable(const AllocaInst *AI) {
// FIXME: If the memory unit is of pointer or integer type, we can permit
// assignments to subsections of the memory unit.
unsigned AS = AI->getType()->getAddressSpace();

// Only allow direct and non-volatile loads and stores...
for (const User *U : AI->users()) {
if (const LoadInst *LI = dyn_cast(U)) {
// Note that atomic loads can be transformed; atomic semantics do
// not have any meaning for a local alloca.
if (LI->isVolatile())
return false;
} else if (const StoreInst *SI = dyn_cast(U)) {
if (SI->getOperand(0) == AI)
return false; // Don't allow a store OF the AI, only INTO the AI.
// Note that atomic stores can be transformed; atomic semantics do
// not have any meaning for a local alloca.
if (SI->isVolatile())
return false;
} else if (const IntrinsicInst *II = dyn_cast(U)) {
if (II->getIntrinsicID() != Intrinsic::lifetime_start &&
II->getIntrinsicID() != Intrinsic::lifetime_end)
return false;
} else if (const BitCastInst *BCI = dyn_cast(U)) {
if (BCI->getType() != Type::getInt8PtrTy(U->getContext(), AS))
return false;
if (!onlyUsedByLifetimeMarkers(BCI))
return false;
} else if (const GetElementPtrInst *GEPI = dyn_cast(U)) {
if (GEPI->getType() != Type::getInt8PtrTy(U->getContext(), AS))
return false;
if (!GEPI->hasAllZeroIndices())
return false;
if (!onlyUsedByLifetimeMarkers(GEPI))
return false;
} else {
return false;
}
}

return true;
}

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Code:

case Intrinsic::invariant_start:
case Intrinsic::invariant_end:
case Intrinsic::invariant_group_barrier:
Intr->eraseFromParent();
// FIXME: I think the invariant marker should still theoretically apply,
// but the intrinsics need to be changed to accept pointers with any
// address space.
continue;

...I have a bad feeling now. Why is this code so incomplete?

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

All the joking aside, I was able to run SA's kernel with alloca promotion enabled with the following modifications:

Code:

static bool isCallPromotable(CallInst *CI) {
  IntrinsicInst *II = dyn_cast(CI);
  if (!II)
   return false;

  switch (II->getIntrinsicID()) {
  case Intrinsic::memcpy:
  case Intrinsic::memmove:
  case Intrinsic::memset:
  case Intrinsic::objectsize:
  case Intrinsic::invariant_group_barrier:
  case Intrinsic::invariant_start:
  case Intrinsic::invariant_end:
   return true;

  default:
  case Intrinsic::lifetime_start: // zawawa
  case Intrinsic::lifetime_end: // zawawa
   return false;
  }
}

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: Kompik on February 21, 2017, 04:26:33 PM

Quote from: zawawa on February 21, 2017, 01:04:08 PM

Quote from: Kompik on February 21, 2017, 12:53:17 PM

sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.

Makes me wonder if I should rewrite the kernel in the GCN assembly from scratch.
I don't know about his personality, but he surely is pretty good at what he is doing.

It seems that some other people are making jokes of him and posting different miners saying that its SP_mod. I dont really get it all, but it seems that he has some miner, that is faster than ewbf, but probably not 530 sols. Sorry for the mystification

Bwahahaha!

Kompik

sr. member

Activity: 463

Merit: 250

Quote from: zawawa on February 21, 2017, 01:04:08 PM

Quote from: Kompik on February 21, 2017, 12:53:17 PM

sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.

Makes me wonder if I should rewrite the kernel in the GCN assembly from scratch.
I don't know about his personality, but he surely is pretty good at what he is doing.

It seems that some other people are making jokes of him and posting different miners saying that its SP_mod. I dont really get it all, but it seems that he has some miner, that is faster than ewbf, but probably not 530 sols. Sorry for the mystification

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: Kompik on February 21, 2017, 12:53:17 PM

sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.

Makes me wonder if I should rewrite the kernel in the GCN assembly from scratch.
I don't know about his personality, but he surely is pretty good at what he is doing.

Kompik

sr. member

Activity: 463

Merit: 250

sp_ has reportedly achieved 530 sols@OC1070 on zcash on his miner. 15+% more over EWBF(Claymore?) which is amazing.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on February 21, 2017, 05:01:40 AM

It seems that there is an ABI-dependent code in LLVM's AMDGPUPromoteAlloca.cpp. This stuff is not straight forward at all, huh. There must be a pointer to the dispatch packet:

Code:

// We must read the size out of the dispatch pointer.
assert(IsAMDGCN);

No wonder the compiler was not working reliably...

If the equihash algorithm were simpler, I'd say just write it from scratch in asm. When you can write in asm, you'll rarely be pleased with compiler-generated code. If assemblers did register allocation, I'd probably write most of my performance-sensitive code in asm.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on February 21, 2017, 12:12:51 AM

I tried splitting row counters between GDS and the L2 cache with 7990, with 2048 rows assigned to each, but it turned out that the miner runs slower this way. I think the overhead of switching between them outweighs the performance gain of the splitting. Perhaps a better way is to alternate between GDS and the L2 cache for each round.

Too bad, although I think performance on the newer cards is more important anyway.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

It seems that there is an ABI-dependent code in LLVM's AMDGPUPromoteAlloca.cpp. This stuff is not straight forward at all, huh. There must be a pointer to the dispatch packet:

Code:

// We must read the size out of the dispatch pointer.
assert(IsAMDGCN);

No wonder the compiler was not working reliably...

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on February 20, 2017, 06:17:57 PM

Quote from: theflow4321 on February 20, 2017, 05:44:09 PM

But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design.

No, each SIMD unit only processes 16 work-items at a time. After 4 clock cycles it is ready to process another 64 work-items (one wave). On cycle 1 the CU will dispatch to SIMD0, then to SIMD1 on cycle 2, SIMD2 on cycle 3, SIMD3 on cycle 4, and back to SIMD0 on cycle 5. It is a bit complicated, and it seems many people don't take the time to fully understand it. Because of that, I've written a short blog post to explain it.

http://nerdralph.blogspot.ca/2017/02/inside-amd-gcn-code-execution.html

edit: Here's a quote from AMD's GCN whitepaper:
Each SIMD includes a 16-lane vector pipeline that is predicated and fully IEEE-754 compliant for single precision and double precision floating point operations,
with full speed denormals and all rounding modes. Each lane can natively executes a single precision fused or unfused multiply-add or a 24-bit integer
operation. The integer multiply-add is particularly useful for calculating addresses within a work-group. A wavefront is issued to a SIMD in a single cycle,
but takes 4 cycles to execute operations for all 64 work items

This is a really nice write-up. Thank you! We do need more documentation that is concise and accurate. Seriously.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Quote from: nerdralph on February 20, 2017, 10:11:43 AM

Quote from: zawawa on February 20, 2017, 03:28:46 AM

Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.

You're deeper into LLVM than I've ever ventured. Is the issue that LLVM is not promoting because it is too conservative about the number of available registers? And it is conservative about registers because it is trying to generate code to support more waves than is optimal?
Since cache hit rates have a big impact on memory latency and therefore the optimal number of waves for latency hiding, it would be virtually impossible for the compiler to determine the optimal number of waves. However a compiler hint might be a solution (i.e. -fnum-waves=X).

p.s. despite CLRX and a couple other sources claiming 4 waves are required for full VALU occupancy, I'm now convinced it can be done with just one.

I had to turn off alloca promotion even to run SA v5's original kernel. I suspect there is a bug in the routine that promotes alloca to LDS. AMD's conservative approach is pretty lazy IMO, but LLVM/Clang seems to allow for compiler hints in the form of attributes.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

I tried splitting row counters between GDS and the L2 cache with 7990, with 2048 rows assigned to each, but it turned out that the miner runs slower this way. I think the overhead of switching between them outweighs the performance gain of the splitting. Perhaps a better way is to alternate between GDS and the L2 cache for each round.

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: theflow4321 on February 20, 2017, 05:44:09 PM

But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design.

No, each SIMD unit only processes 16 work-items at a time. After 4 clock cycles it is ready to process another 64 work-items (one wave). On cycle 1 the CU will dispatch to SIMD0, then to SIMD1 on cycle 2, SIMD2 on cycle 3, SIMD3 on cycle 4, and back to SIMD0 on cycle 5. It is a bit complicated, and it seems many people don't take the time to fully understand it. Because of that, I've written a short blog post to explain it.

http://nerdralph.blogspot.ca/2017/02/inside-amd-gcn-code-execution.html

edit: Here's a quote from AMD's GCN whitepaper:
Each SIMD includes a 16-lane vector pipeline that is predicated and fully IEEE-754 compliant for single precision and double precision floating point operations,
with full speed denormals and all rounding modes. Each lane can natively executes a single precision fused or unfused multiply-add or a 24-bit integer
operation. The integer multiply-add is particularly useful for calculating addresses within a work-group. A wavefront is issued to a SIMD in a single cycle,
but takes 4 cycles to execute operations for all 64 work items

theflow4321

newbie

Activity: 1

Merit: 0

But you need 4 waves to take advantage of the GCN timing architecture in that 4 lock-steps / 4 SIMDs design.

The steps are related to the instruction/ scalar processor but a single ALU can addresse like 12 registers per cycle if I'm not mistaken within that Single Instruction Multiple Data. Or something.

http://gpuopen.com/anatomy-total-war-engine-part-2/

"VGPRS
From the point of view of a single thread each of the vector registers hold a single value. 256 VGPRs can mean 64 float4, 128 float2, 256 float or any combination of these. For example, if we sample a texture, but only use its RGB components and not its alpha channel, it will take up 3 VGPRs. Let’s do some math: we want to blend 8 terrain layers. Each layer has a diffuse, a normal and a spec/gloss texture. We use all 4 channels of the diffuse texture, 2 channels of the normal texture and 2 channels of the spec/gloss texture. That’s 4+2+2 = 8 VGPRs per layer. Multiplied by 8 layers is 64 VGPRs. So we’ve already used up a quarter of all the available registers in the SIMD and we haven’t even started to talk about other parts of the code, blend maps, height map, etc. Some registers can be reused, but as we’ll see soon it’s not as trivial as it seems."

"The number of used registers is important, because modern hardware runs multiple wavefronts at the same time. You can think about this as processing multiple pixels at the same time. This means that one of the main limiting factors on the number of pixels we can have in flight is the number of registers the shaders require. If a single SIMD in the hardware has 256 VGPRs and a shader is using 200 of them for example, the GPU can work only on one wavefront at a time. After the first wavefront is launched on a SIMD it leaves 56 registers unused, which is not enough to accommodate another wavefront running the same shader. If it’s using 110 VGPRs, two wavefronts can run at the same time (112+112=224 and 32 registers remain unused). If the shader uses only 24 or less VGPRs, the hardware can run 10 wavefronts on a SIMD at the same time. 10 concurrent wavefronts is the current maximum for Fiji GPUs, such as the Radeon® Fury X. This limit is hard wired."

nerdralph

sr. member

Activity: 588

Merit: 251

Quote from: zawawa on February 20, 2017, 03:28:46 AM

Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.

You're deeper into LLVM than I've ever ventured. Is the issue that LLVM is not promoting because it is too conservative about the number of available registers? And it is conservative about registers because it is trying to generate code to support more waves than is optimal?
Since cache hit rates have a big impact on memory latency and therefore the optimal number of waves for latency hiding, it would be virtually impossible for the compiler to determine the optimal number of waves. However a compiler hint might be a solution (i.e. -fnum-waves=X).

p.s. despite CLRX and a couple other sources claiming 4 waves are required for full VALU occupancy, I'm now convinced it can be done with just one.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

Alright, 10 out of 12 Equihash kernels are working without LLVM's alloca promotion.

zawawa

sr. member

Activity: 728

Merit: 304

Miner Developer

The new compiler can build SA's old kernel now.
Let's try GG's kernel again...

megacrypto

sr. member

Activity: 291

Merit: 250

Quote from: Balitorium on February 18, 2017, 08:55:21 AM

Quote from: megacrypto on February 17, 2017, 07:35:36 PM

i could never understand how to compile / use this on ubuntu 16.04 ? i know it might be a stupid question, but could someone kindly write a simple full step by step how-to build on ubuntu 16.04? it will be very much appreciated

I'm running my rigs on ubuntu 16.04 and it's pretty straight forward with basic Linux know how. Don't have the spare time to write up a full how-to now but the basic steps as far as I remember are:

- download and install latest AMD SDK
- download and install latest AMD PRO GRU driver
- download zawawas miner from github
- use "apt-get install" to clear sgminer dependencies (see readme)
- bulid from source according to readme

If you run into troubles on the way I pretty sure someone here will help you figure it out Wink

i got the first 3 steps all fine (actually using sgminer-gm right now) its the last 2 steps i seem not to find my way around!! it could be just straight forward, but for some reason i just cant seem to see !!

)

cryptominer420

sr. member

Activity: 450

Merit: 255

@zawawa
Sent you $1 to your BTC address it's not much but right now I'm living off my BTC.

Topic: Gateless Gate Sharp 1.3.8: 30Mh/s (Ethash) on RX 480! - page 163. (Read 214458 times)