Okay, it's not perfect, but it'll never be. Sources that I've written or modified have been heavily commented to help out devs who are learning. The OpenCL was entirely written by me, and SGMiner itself was modified by me not just to run the modified kernel, but to clean it, remove algorithms besides WhirlpoolX for simplicity, and add comments. The horrid lib SPH is no longer used at all, instead, a SHA-256 implementation that was actually already there and never even compiled is used for several parts of SGMiner that used to call into lib SPH, and the CPU Whirlpool-512 implementation is a rather clean one I found
here. That is also where I found seemingly the only copy of Whirlpool-512 done without tables, on which I based my bitsliced implementation. Another modification I made was to check the OpenCL version at runtime and use clCreateCommandQueueWithProperties() if the version is 2.0 or greater; it replaces clCreateCommandQueue(), and the latter is now deprecated and should not be used, but even the current official SGMiner isn't updated. Anyways, go look at the source for yourself. It will be
here momentarily.
Indeed the code is very clear. So clear that it makes almost impossible to miss one fat low hanging fruit, let's see who will come here screaming how stupid is Wolf0's miner and how easy it was to optimize it
Thank you for #pragma unroll 1 line, I recently learned another way to prevent unrolling
here. Didn't tied it yet, my kernel is fully unrolled ATM. BTW, I also started from nayuki implementation, found it by searching the first bytes of S-box from Whirlpool whitepaper.
amd_bfe is kind of poisoned knowledge
, it's too easy to just use good solution missing better one. From my (uncleaned) kernel
#define Toff8(off8) (*(const LOCAL X64*)&(((const LOCAL UINT8*)TAll_local)[off8]))
#define LUT5(v) (ASX64(Toff8(bitselect(1U << 11, (v.y >> 5), 0x7F8U))).yx)
// ...
LOCAL X64 TAll_local[256*4];
// ...
stateBX64[7&(p+5)] ^= LUT5(stateAX64[p]);
BTW, may be it's better to check for clCreateCommandQueueWithProperties() and such at configure time?
Hm... configure time is an excellent idea. It's only bad if it's not there during the compile; if using a binary, at runtime the missing function will never be called.
I don't think that a lot of people will find that too easily... but yes, it was omitted on purpose. After all, I can't just give people everything, they have to find it themselves.
Further, I pushed a small update. I booted Kineta into the rarely-used Windows install on the another HDD, and tried out the miner myself on Catalyst 15.3 Beta. It works even better on Tahiti, pretty much the same on Hawaii, but totally butchers the code for Tonga. The upside is, the only GPU with a Tonga chip is the 285, and I don't believe many miners have that. So, I would recommend not only 14.12, but also 15.3. I pushed to git a change to the README.md file to reflect this.
Hashrates on 15.3 at same clocks, with settings shown, for comparison (NSFW):
https://ottrbutt.com/miner/whirlpoolxwolfpub-03232015.png