Looks very good. Have you tweaked the kernel settings or left the defaults there?
I actually rewrote most of it:
- Chacha and Salsa are now done vectorized on GCN. Unroll level is still three for both.
- Blake2S is done parallel, too
- Your bytewise copies were left for now - the bytewise XORs are now done by uints
- Removed your little AND operation on bufptr
- Replaced your if/else structure for creating the output with a single loop doing a bytewise XOR (yes, it works in 100% of cases)
- Created a BlkMix() function for cleanliness
- Split the work over several kernels
- Added ScratchpadLoad/ScratchpadStore/ScratchpadMix functions for cleanliness and a better striped access pattern in memory
- Parallelized the SMix() calls
- Abused the TMTO vulnerability, and made it configurable in the miner
- Shrunk code size by a lot
EDIT:
Forgot some things:
- SMix() itself is completely redesigned from the ground up - needed to fit in the AMD GCN code cache or it shits all over the hashrate. (see side note 0)
- A macro has been added to support moving the Salsa permutation required for parallel computation outside of the actual Salsa implementation. Disabled because it sucks ass, for some reason.
- Worksizes and kernel local size dimensions have had to be changed - this was fixed in the miner host code to accomodate the new kernel.
- Host code now allocates twice the size it used to per work item IF you don't use the TMTO - it adjusts the buffer accordingly when that option is used (halving it, cutting it to one fourth, and so on)
Next up:
- Salsa, ChaCha, Blake2S, SMix, and BlkMix are now all my own code - while FastKDF I have heavily modified, it still could use some rewrite work, I think. (see side note 1)
- Support for devices with no capability regarding unaligned stores could be added - this I'm a lot more likely to do if that code performs well on GCN.
- Code cleanups - got a lot of warnings, unused vars and unused functions laying around, many of which are remnants of the original code.
Side note 0:
SMix is now done in two seperate work-items for one Neoscrypt digest - this I did with more kernel dimensions. Code size was a bitch - if it doesn't fit in the code cache, you take a substantial hit. Now, it's not too hard to figure out this with no TMTO support, but I had to structure it EXTREMELY oddly to FINALLY convince the stupid AMD OpenCL compiler to make the damned thing not repeat BlkMix code in the binary. Easy to tell when it did, as code size doubled or more. As an aside, I *could* have dropped to GCN assembly here, as it has true support for function calls, which the nigh useless OpenCL compiler doesn't fucking use.
Side note 1:
FastKDF could probably benefit from a little more (or different usage of) LDS - not sure how yet, but it's a hunch; I'll experiment with it. The annoying shit is the scratch regs still present, requiring global memory reads/writes. It could also do with dropping some registers on Tahiti/Pitcairn chips. Code size on both runs of FastKDF have room to breathe, so if I have to raise code size a bit to benefit in other areas, so be it as long as it fits in the cache. The annoying and stupid bytewise addressing done by FastKDF can be mitigated - allowing us full uint-sized loads/stores. I've replaced the bytewise XOR with an implementation doing this; results are excellent, but the scratch regs remain regardless. Code size and branching can also likely be cut somewhat - I'm hoping all uint-sized accesses remove the scratch registers that the OpenCL compiler keeps using.