Hi All,
Important things to the top:
* I slightly updated the HetPas150111_Groestl.zip -> MH/s values are now the same as in SG.
* I've updated the main page with benchmark data I've collected:
http://realhet.wordpress.com/gcn-asm-groestl-coin-kernel/* I've uploaded the diamondTahiti binary, so now there are 2 precompiled bins, thank you utahjohn!
* My MH/s missunderstanding.
Thany you all for the investigations, now I see it clearly.
When I tried the groestlcoin.cl on my card on 14.9 and it ran on 2 MH/s. If I convert the 25MH/s from R9 290 down to my HD7770, then I should have got 4 MH/s.
And here comes my bad decision:
I didn't believed that the 14.6->14.9 changes were so bad that they slowed the kernel more than 2x. Actually it was 2.6x slower than my expectations.
And because the algorithm contains technically 2 hash calculations I thought that multiplying by 2 gives me the correct MH/s.
But as it turned out they indeed broke 14.9 so badly.
So If I ever thought about hating ocl, now I hate it more than twice. To be precise I hate it 2.6x more.
But on the optimistic side because of 14.9 made an exceptional quality of cr4p out of the ocl kernel, that gave me the false feel of success to continue optimizing, haha.
Anyways, I'm happy that it is solved now.
* HetPas and Catalyst version
When you compile an ASM kernel, my compiler generates a pure binary (and some parameters eg. LDS size)
In order to make it run it have to generate a complicated ELF binary image, so it will ask for one from the OpenCL compiler.
This small skeleton kernel contains the kernel parameters that you request in the assembly source.
For this groestl kernel I supply it a special skeleton.cl (see below in this post).
So when CpenCL compiled this small skeleton kernel, my program will patch the binary and other parameters into is. Also cut out every unwanted parts such as ocl, llwmir, amd_il sections. There is even a few kilobytes of zeroes in the ELF just to be compatible with terribly old hardware, I cut that out too.
And because I use the current OpenCL system, that's why the produced binary will be only compatible with that kind of hardware.
* Binary kernels and Catalyst versions
AFAIK when a kernel binary os loaded by clBuildKernei it doesn't check if it is compatible by cat version. Or any other version number.
So the binary is quiet transferable between versions.
When incompatibility occurs that can be caused by these things:
- driver developers changed the ELF file structure (for example they removed some sections: in 13.4 they removed the amd_il section from the inner ELF image. Yes, it is an ELF inside an ELF.
) This can cause an error ow access violation when loading the kernel.
- driver developers changed the way/format kernel parameters are passed. This kind of incompatibility can causes a crash on the GPU.
So it doesn't matter that you compile with hetpas on 14.7, I just wrote 14.9 on my blog because I was 100% sure that my program works on 14.9
* "cross compile" option
Yea, it would be a nice feature. To do it I need binaries from all hardware, so I can 'dissect' them and maybe find out how to produce them manually.
I'm not going to understand the complete binary structure as amd can change it any time, and they must do it when they improve things anyways.
I only want to inject GCN binary into the hardware as simple as I can.
But with analyzing different binaries maybe I can find out how to change a binary to be compatible with a specific hardware.
For example If there are too much hardware dependent options that also depends on the kernel's parameters, then it's impossible to do without fully understanding how parameters are exchanged between the driver and the (specific) hardware.
* 32bit/64bit
Ok, now I understand. HetPas is all 32bit, so I haven't noticed there can be 64bit ELF's too.
I can guess that the Linux driver uses a an API of the OS to access ELF contents and that's why 32/64nit is important...
Please compile this kernel to a 64 bit binary and send me:
__attribute__((reqd_work_group_size(256, 1, 1)))
void search(__global unsigned char* block, volatile __global uint* output, const ulong target)
{ if(target>0) output[get_local_id(0)] = block[get_global_id(0)]; }
* "neoscrypt kernel"
Is this similar to LiteCoin?
1 year ago I played with LiteCoin's salsa, It was fun, but I wasn't able to outperform opencl.
But in the future I have plans to make a special salsa that will use LDS instead of the slow ram. This will be an interesting experiment as I gonna have to try some assembly exclusive things in order to outperform the original kernel:
- To be able to use 64KB lds for one thread I'll have to connect wavefront pairs to share their 32KB allocs with each other. For this I have to know that the current wavefront is running on which compute unit (s_get_hwreg).
- synching the two kernels on each CU individually will require some research. (GCN has an awesome global wave synch feature by hardware, so maybe there is something for 'local' too. If not, maybe I can poll GDS)
- because only one 'thread' will work actively on a CU, there'll be no latency hiding, so I have to program the kernel in a paralell way (but, no probs, I'll have all the 256 regs...)
- By the textbook: LDS throughput is 64x better(IMO it's not) than MEM throughput on a HD7970. So this would be the benefit.
- threads in workitems can copy register data from each other. So while I calculate only 1 salsa using the 2*32KB LDS for lookup (lookup_gap=2), I can spread data across more lanes on the wavefront and make calculations in paralell.
I've just checked neoscrypt.cl, it's insane
But if I see it well, the half of it is SALSA.
* "Guys! We do not need more optimization!"
I've thought about this too. But I think if everyone use better kernels, then everyone will use the same power to get the same profit as difficulty will be harder but mining will require less power.
But what if not everyone uses the faster kernel. I think my compuler/IDE is helping in this a lot, as it is kinda user unfriendly
* Just a question about LiteCoin
Do you know that is it worth to optimize it on GPU? Or too many FPGA/ASIC there too?
I'm just curious only. I'd like to play that salsa algo, but my free time is running out soon.
* @qwep1 ### RESULT IS WRONG ####
Something is totally went bad there.
Tahiti is tested already, and the 'elapsed' is ok too, but in the memory dump that is garbage. What Catalyst are you using? Is the memory clock setting ok?
* kernel_dump\ folder is in the same folder as the groestl_isa.hpas program that you're running in HetPas.
* "AVOID 14.9 like the plague"
Haha, I'll try 14.12 now.