It seems like your latest kernel and mine have problems if BFI_INT gets forced of via (BFI_INT=false) ... it seems the results are invalid every time.
Any idea Phateus?
Perhaps #define Ch(x, y, z) bitselect(x, y, z) is not right?
Edit and solved, non BFI_INT Ch has to be:
#define Ch(x, y, z) bitselect(z, y, x)
If you want to thank someone, you can donate to 1LY4hGSY6rRuL7BQ8cjUhP2JFHFrPp5JVe (Vince -> who did a GREAT job during my kernel development)!
Dia
Awesome, thank you! I was under the assumption that BFI_INT and bitselect were the same operation, apparently, the operand order is different. I will fix it in my next release.
Thank you everyone for your support (both in BTC and discussion).
I should have a drop-in version of the kernel available for cgminer soon, so anyone wanting to try out the pre-release, I'll be posting it tonight.
@BOARBEAR
*sigh*.... come on man... do you even read my posts? There is no single cause of the bad performance. 2.2 executes less instructions and uses less registers than 2.1, but as I said... there is some weird issue which makes openCL slower behind the scenes. My best guess is that it has to do with register allocation.
The GPU has a total of 256x32x4 registers (8192 UINT4). At the most, there are 256 threads per workgroup (8192/256 = 32 registers per thread). Using VECTORS, the number of registers is far below this number, therefore the hardware can operate on the maximum allowable threads at a time. However, when you compile with VECTORS4, there is more than 32 registers per thread. OpenCL must determine how to allocate the threads, and the utilization of the video card is sub-optimal) Below is a diagram of what I think is going on.
4 thread groups running simultaneously VECTORS (2 running at a time)
[1111111122222222]
[3333333344444444]
using an optimal version of VECTORS4, it would look much like this (double the work is done per thread)
[1111111111111111]
[2222222222222222]
[3333333333333333]
[4444444444444444]
now making it use slightly less resources will make it slower because the threads are out of sync and there will be overhead in syncing and tracking data within threadgroups:
[1111111111111112]
[2222222222222233]
[3333333333333444]
[4444444444445555]
Now, I may be waaaaay off here, but something like this is what makes sense to me. Especially, since this would explain why decreasing the memory actually improves performance in some cases (by forcing synchronization).
Anyway, enough of my off-topic analysis...
I will release a version that will work with cgminer early next week (looks like he has already implemented diapolo's old version).
Looking forward to this !!
Just sent one coin your way, and there's another once the work is done.
We are hitting a ceiling with opencl in general (and perhaps with the current hardware). In one of the mining threads, vector76 and I were discussing the theoretical limit on hashing speeds... and unless there is a way to make the Maj() operation take 1 instruction, we are within about a percent of the theoretical limit on minimum number of instructions in the kernel unless we are missing something.
Out of curiosity, have you looked into trying to code a version
directly in AMD's assembly language and bypassing OpenCL entirely ?
(I'm thinking: since we're already patching the ELF output, this seems
like the logical next step
)
Also, have you looked at AMD CAL ? I know this is what ufasoft's miner
uses (
https://bitcointalk.org/index.php?topic=3486.500), and also what
zorinaq considers the most efficient way to access AMD hardware (somwhere
on
http://blog.zorinaq.com)
Replacing one instruction in the ELF with another that uses the exact same inputs/outputs is one thing, but manually editing the ASM code is another thing entirely. Besides, with the work that has been done the GPU is already at >99% of the theoretical maximum throughput. (ALU packing) And as said above, we are also close to the theoretical minimum number of instructions to correctly run SHA256.
Also, if you look near the end of the hdminer thread you will notice that users are able to get the same hashrates from phatk on 69xx. For 58xx and other VLIW5 cards phatk is significantly faster than hdminer. If that's the best he can do with CAL then I don't see any reason to use it. hdminer had a substantial performance advantage back in March/April, but with basically every miner supporting BFI_INT this is no longer the case.
Agreed, the kernel itself is pretty optimal. I might look into calling lower level CAL functions to manage the (OpenCL compiled) GPU threads (instead of using openCL), but I doubt this will give any speedup (although, I might be able to reduce the CPU overhead).