[ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 11.

utahjohn

hero member

Activity: 630

Merit: 500

forget litecoin and all scrypt coins, they are asic territory now and GPU mining pointless on them. Wolf0 can explain what he did to optimize neoscrypt, It was some major improvements ... I might be able to dig up a link ...

Here is last neoscrypt OCL I got from wolf0
https://mega.co.nz/#!cFEGTBBY!snQhOeLs6E_giKx2rY_i7XNcv95dASkrrRzlDOq7fIE
and some update from the forum
https://forum.feathercoin.com/index.php?/topic/7780-dev-neoscrypt-gpu-miner-public-beta-test/page-41#entry71777

realhet

newbie

Activity: 32

Merit: 0

Hi All,

Important things to the top:
* I slightly updated the HetPas150111_Groestl.zip -> MH/s values are now the same as in SG.
* I've updated the main page with benchmark data I've collected: http://realhet.wordpress.com/gcn-asm-groestl-coin-kernel/
* I've uploaded the diamondTahiti binary, so now there are 2 precompiled bins, thank you utahjohn!

* My MH/s missunderstanding.
Thany you all for the investigations, now I see it clearly.
When I tried the groestlcoin.cl on my card on 14.9 and it ran on 2 MH/s. If I convert the 25MH/s from R9 290 down to my HD7770, then I should have got 4 MH/s.
And here comes my bad decision:
I didn't believed that the 14.6->14.9 changes were so bad that they slowed the kernel more than 2x. Actually it was 2.6x slower than my expectations.
And because the algorithm contains technically 2 hash calculations I thought that multiplying by 2 gives me the correct MH/s.
But as it turned out they indeed broke 14.9 so badly.
So If I ever thought about hating ocl, now I hate it more than twice. To be precise I hate it 2.6x more. Cheesy

But on the optimistic side because of 14.9 made an exceptional quality of cr4p out of the ocl kernel, that gave me the false feel of success to continue optimizing, haha.
Anyways, I'm happy that it is solved now.

* HetPas and Catalyst version
When you compile an ASM kernel, my compiler generates a pure binary (and some parameters eg. LDS size)
In order to make it run it have to generate a complicated ELF binary image, so it will ask for one from the OpenCL compiler.
This small skeleton kernel contains the kernel parameters that you request in the assembly source.
For this groestl kernel I supply it a special skeleton.cl (see below in this post).
So when CpenCL compiled this small skeleton kernel, my program will patch the binary and other parameters into is. Also cut out every unwanted parts such as ocl, llwmir, amd_il sections. There is even a few kilobytes of zeroes in the ELF just to be compatible with terribly old hardware, I cut that out too.
And because I use the current OpenCL system, that's why the produced binary will be only compatible with that kind of hardware.

* Binary kernels and Catalyst versions
AFAIK when a kernel binary os loaded by clBuildKernei it doesn't check if it is compatible by cat version. Or any other version number.
So the binary is quiet transferable between versions.
When incompatibility occurs that can be caused by these things:
- driver developers changed the ELF file structure (for example they removed some sections: in 13.4 they removed the amd_il section from the inner ELF image. Yes, it is an ELF inside an ELF. Cheesy

) This can cause an error ow access violation when loading the kernel.
- driver developers changed the way/format kernel parameters are passed. This kind of incompatibility can causes a crash on the GPU.
So it doesn't matter that you compile with hetpas on 14.7, I just wrote 14.9 on my blog because I was 100% sure that my program works on 14.9

* "cross compile" option
Yea, it would be a nice feature. To do it I need binaries from all hardware, so I can 'dissect' them and maybe find out how to produce them manually.
I'm not going to understand the complete binary structure as amd can change it any time, and they must do it when they improve things anyways.
I only want to inject GCN binary into the hardware as simple as I can.
But with analyzing different binaries maybe I can find out how to change a binary to be compatible with a specific hardware.
For example If there are too much hardware dependent options that also depends on the kernel's parameters, then it's impossible to do without fully understanding how parameters are exchanged between the driver and the (specific) hardware.

* 32bit/64bit
Ok, now I understand. HetPas is all 32bit, so I haven't noticed there can be 64bit ELF's too.
I can guess that the Linux driver uses a an API of the OS to access ELF contents and that's why 32/64nit is important...
Please compile this kernel to a 64 bit binary and send me:
__attribute__((reqd_work_group_size(256, 1, 1)))
void search(__global unsigned char* block, volatile __global uint* output, const ulong target)
{ if(target>0) output[get_local_id(0)] = block[get_global_id(0)]; }

* "neoscrypt kernel"
Is this similar to LiteCoin?
1 year ago I played with LiteCoin's salsa, It was fun, but I wasn't able to outperform opencl.
But in the future I have plans to make a special salsa that will use LDS instead of the slow ram. This will be an interesting experiment as I gonna have to try some assembly exclusive things in order to outperform the original kernel:
- To be able to use 64KB lds for one thread I'll have to connect wavefront pairs to share their 32KB allocs with each other. For this I have to know that the current wavefront is running on which compute unit (s_get_hwreg).
- synching the two kernels on each CU individually will require some research. (GCN has an awesome global wave synch feature by hardware, so maybe there is something for 'local' too. If not, maybe I can poll GDS)
- because only one 'thread' will work actively on a CU, there'll be no latency hiding, so I have to program the kernel in a paralell way (but, no probs, I'll have all the 256 regs...)
- By the textbook: LDS throughput is 64x better(IMO it's not) than MEM throughput on a HD7970. So this would be the benefit.
- threads in workitems can copy register data from each other. So while I calculate only 1 salsa using the 2*32KB LDS for lookup (lookup_gap=2), I can spread data across more lanes on the wavefront and make calculations in paralell.
I've just checked neoscrypt.cl, it's insane Cheesy

But if I see it well, the half of it is SALSA.

* "Guys! We do not need more optimization!"
I've thought about this too. But I think if everyone use better kernels, then everyone will use the same power to get the same profit as difficulty will be harder but mining will require less power.
But what if not everyone uses the faster kernel. I think my compuler/IDE is helping in this a lot, as it is kinda user unfriendly Cheesy

* Just a question about LiteCoin
Do you know that is it worth to optimize it on GPU? Or too many FPGA/ASIC there too?
I'm just curious only. I'd like to play that salsa algo, but my free time is running out soon.

* @qwep1 ### RESULT IS WRONG ####
Something is totally went bad there.
Tahiti is tested already, and the 'elapsed' is ok too, but in the memory dump that is garbage. What Catalyst are you using? Is the memory clock setting ok?
* kernel_dump\ folder is in the same folder as the groestl_isa.hpas program that you're running in HetPas.

* "AVOID 14.9 like the plague"
Haha, I'll try 14.12 now.

utahjohn

hero member

Activity: 630

Merit: 500

Note Intel GPU can be used for other algo such as X11, neoscrypt. Now that AMD is on gpu-platform 0, you can try re-enable intel and see if it will pop up on gpu-platform 1.
check with sgminer -n in a command prompt window.
to specify which platform to use on sgminer command line --gpu-platform 0, or 1 ...
display # counts still start from 0 on each gpu-platform.

I have heard of ppl also running nvidia cards in same box with AMD, yet another gpu-platform selection ...

JuanHungLo

hero member

Activity: 935

Merit: 1001

I don't always drink...

Quote from: physixz on January 12, 2015, 08:27:09 PM

Thanks, I wasnt disabling the intel in UEFI which was my problem. its working now at 26.5MH/s per card which is amazing.

I'm using 14.7r3, xI 2048, 1100/150, -w 256 undervolted to 1.00 and getting 23.38 MH/s. What's your config?

utahjohn

hero member

Activity: 630

Merit: 500

Quote from: physixz on January 12, 2015, 08:27:09 PM

Thanks, I wasnt disabling the intel in UEFI which was my problem. its working now at 26.5MH/s per card which is amazing.

Pallas is getting 30MHs on 290 with realhet ASM kernel ... so some further tuning now, play with intensity, gpu clock, drop mem clock to lowest possible (150 on my 280x).

physixz

newbie

Activity: 13

Merit: 0

Thanks, I wasnt disabling the intel in UEFI which was my problem. its working now at 26.5MH/s per card which is amazing.

utahjohn

hero member

Activity: 630

Merit: 500

Quote from: physixz on January 12, 2015, 06:43:48 PM

when i run HetPas it doesn't detect the graphics cards even though im running 14.9 drivers. anybody know why?

i get either Runtime error: openCL error: CL_Device_not_found or no GCN device found when i re-enable the intel integrated graphics. i am running 3 R9 290's and ive tried 14.9 and 14.12 beta drivers and neither work

I had to disable intel onboard graphics. uninstall all drivers and reinstall 14.7RC3. What is happening is your AMD cards are being on wrong gpu-platform 1 in my case and Intel was gpu-platform 0.
Hetpas appears to be looking only on gpu-platform 0
completely uninstall all display drivers with DDU and then go to BIOS and disable onboard intel. When AMD cards redetect they will appear on gpu-platform 0

AVOID 14.9 like the plague, it's OCL compiler is retarded.

physixz

newbie

Activity: 13

Merit: 0

when i run HetPas it doesn't detect the graphics cards even though im running 14.9 drivers. anybody know why?

i get either Runtime error: openCL error: CL_Device_not_found or no GCN device found when i re-enable the intel integrated graphics. i am running 3 R9 290's and ive tried 14.9 and 14.12 beta drivers and neither work

utahjohn

hero member

Activity: 630

Merit: 500

Quote from: pallas on January 12, 2015, 04:35:34 PM

The new version compiles fine, but of the two GPUs only id 1 works, id 0 doesn't produce any valid work unit.
Speed: r9 290 30Mh/s, r9 290x 33Mh/s (1100 MHz)
My experimental opencl kernel is a couple percent faster.

care to share newest incarnation of OCL ? PM me a link for personal use only

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

The new version compiles fine, but of the two GPUs only id 1 works, id 0 doesn't produce any valid work unit.
Speed: r9 290 30Mh/s, r9 290x 33Mh/s (1100 MHz)
My experimental opencl kernel is a couple percent faster.

utahjohn

hero member

Activity: 630

Merit: 500

@realhet
OK a few things I have discovered:
1. Hetpas does compile and run ok on 14.7RC3.
So no need to install 14.9

2. Test Runs:
Target: Tahiti core:1150 MHz cu:32 ram:3072 MB uid:4098
ext: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event
* core MHz value is not always accurate, use Catalyst Control Center (or ADL) instead!

Using original OpenCL code
Kernel binary saved: C:\Miners\HetPas150111_Groestl\groestl\kernel_dump\kernel.elf

elapsed: 72.626 ms 36.095 MH/s gain: 9.02x
elapsed: 70.712 ms 37.072 MH/s gain: 9.27x
elapsed: 70.718 ms 37.069 MH/s gain: 9.27x
elapsed: 70.741 ms 37.057 MH/s gain: 9.26x

Functional test: RESULT IS OK

Target: Tahiti core:1150 MHz cu:32 ram:3072 MB uid:4098
ext: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event
* core MHz value is not always accurate, use Catalyst Control Center (or ADL) instead!

Using new GCN ASM code
Kernel binary saved: C:\Miners\HetPas150111_Groestl\groestl\kernel_dump\kernel.elf

elapsed: 53.629 ms 48.881 MH/s gain: 12.22x
elapsed: 50.666 ms 51.740 MH/s gain: 12.93x
elapsed: 50.677 ms 51.729 MH/s gain: 12.93x
elapsed: 50.660 ms 51.746 MH/s gain: 12.94x

Functional test: RESULT IS OK

3. Calculated speed gain is close to actual speed gain of 1.40x as shown running sgminer

4. First run of OCL should be reference value of 1.0x to do proper comparison, this needs to be reset in hetpas for each architecture.

5. Your timing calculations appear to be wrong. Single 280x OCL is 18.5MHs, Single 280x ASM is 26.0MHs.
Are you sure hetpas is not using BOTH of the cards in my test box when running tests? I am mining in sgminer with SINGLE card, other is turned off and used in another instance of sgminer mining neoscrypt ...

qwep1

hero member

Activity: 610

Merit: 500

where there is a folder kernel_dump\ ???I can not find

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: physixz on January 12, 2015, 01:59:14 PM

Whats the best driver version to use as i can only get 11MH/s from my R9 290

14.6b or 14.7

Or use the precompiled binary.

physixz

newbie

Activity: 13

Merit: 0

Whats the best driver version to use as i can only get 11MH/s from my R9 290

qwep1

hero member

Activity: 610

Merit: 500

Target: Tahiti Series: 7 Core:1100 MHz CU:32 RAM:3072 MB UID:4098
ext: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_d3d10_sharing cl_khr_d3d11_sharing cl_khr_dx9_media_sharing cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event
* core MHz value is not always accurate, use Catalyst Control Center (ADL) instead!

elapsed: 69.778 ms 37.568 MH/s gain: 9.39x
elapsed: 54.247 ms 48.324 MH/s gain: 12.08x
elapsed: 54.269 ms 48.305 MH/s gain: 12.08x
elapsed: 54.236 ms 48.334 MH/s gain: 12.08x
############### RESULT IS WRONG ###################
   idx hi lo hi lo
   0: 00000000 00000000 0 0
   1: 00000000 00000000 0 0
   2: 00000000 00000000 0 0
   3: 00000000 00000000 0 0
   4: 00000000 00000000 0 0
   5: 00000000 00000000 0 0
   6: 00000000 00000000 0 0
   7: 00000000 00000000 0 0
   8: 00000000 00000000 0 0
   9: 00000000 00000000 0 0
   A: 00000000 00000000 0 0
   B: 00000000 00000000 0 0
   C: 00000000 00000000 0 0
   D: 00000000 00000000 0 0
   E: 00000000 00000000 0 0
   F: 00000000 00000000 0 0
   10: A9A41A9D 9337706F -1448863075 -1825083281
   11: 370D1AF4 DD743586 923605748 -579586682
   12: CB7EB389 EADF9917 -880888951 -354445033
   13: 25FA6A42 76EDCD1E 637168194 1995296030
   14: 91783455 C7EE8F10 -1854393259 -940667120
   15: F60C362A FD9AFAB3 -166971862 -40174925
   16: 038C0C0F D2E4564F 59509775 -756787633
   17: EA28DD29 3A1B41CA -366420695 974864842
   18: 708C1E9A DFCDC04F 1888231066 -540164017
   19: 00000000 A7B76679 0 -1481152903
   1A: 00000000 00000000 0 0
   1B: 00000000 00000000 0 0
   1C: 00000000 00000000 0 0
   1D: 00000000 00000000 0 0
   1E: 00000000 00000000 0 0
   1F: 00000000 00000000 0 0
this is normal or am I doing something wrong

Quote

do not get me compile a file

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: ?? on ??

@Pallas
It is extremely rare for me to see any orphan when solo mining so I would venture to guess your network is too slow.

probably too few nodes nearby: I have 20/30 msec round trip time to big internet nodes in my country.
having few fast nodes nearby means my blocks take a lot of time to spread thru the diamond network.
or a lot of bad luck Cheesy

utahjohn

hero member

Activity: 630

Merit: 500

Quote from: Star65 on January 12, 2015, 09:50:10 AM

Guys! We do not need more optimization! If all we get a faster kernel, then the difficulty will increase proportionally. Accordingly, we will not get more coins, but will pay more for electricity. Profits will only decrease. Sad

Faster kernel good for dev only (as a reward for their hard work), i think so.

Not everyone will use new kernel so there is an advantage. Yes diff will go up some. Also as diff goes up many miners will drop like dead flies, so It will even out ...
Tell all your friends to Cloudmine/Multipool mine and stop direct mining, this will lower diff for diehard solo miners

3 blocks DMD since I started ASM kernel last night ...

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Star65 on January 12, 2015, 09:50:10 AM

Guys! We do not need more optimization! If all we get a faster kernel, then the difficulty will increase proportionally. Accordingly, we will not get more coins, but will pay more for electricity. Profits will only decrease. Sad

Faster kernel good for dev only (as a reward for their hard work), i think so.

true....
until you have half the hashpower by a couple fpga miners (or so they say) ;-)

Star65

member

Activity: 109

Merit: 13

Guys! We do not need more optimization! If all we get a faster kernel, then the difficulty will increase proportionally. Accordingly, we will not get more coins, but will pay more for electricity. Profits will only decrease. Sad

Faster kernel good for dev only (as a reward for their hard work), i think so.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

HOW TO TELL IF AN SGMINER BIN FILE IS 32 OR 64 BIT

If the filename, generated by sgminer, ends in l4.bin it is 32 bit (8 x 4 = 32)
If the filename, generated by sgminer, ends in l8.bin it is 64 bit (8 x 8 = 64)

They are incompatible.

Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 11. (Read 61261 times)