[ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 14.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: realhet on January 08, 2015, 03:08:22 PM

Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?

Intensity 24 is too much, I'd stay between 20 and 22, otherwise you'll produce a lot of rejected shares (or orphans if solo mining).
The shaders option is ignored for groestl.
The hashrate should be calculated on the full computation, i.e. 2 chained hashes.
What kernel are you using?

utahjohn

hero member

Activity: 630

Merit: 500

Quote from: realhet on January 08, 2015, 03:08:22 PM

Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?

Are you sure it is running your kernel?. Look in your sgminer dir, for a .bin file generated by OCL it may be running default groestlcoin OCL. delete .bin and replace with your own of same name generated, it will not be regenerated it it exists in dir. you must delete .bin whenever you change configs to force OCL recompile ... but you don't want that, u want to run your asm kernel ... so will have to figure out the parameter passing from sgminer ...

realhet

newbie

Activity: 32

Merit: 0

Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: realhet on January 08, 2015, 10:43:38 AM

Unfortunately there is no --benchmark parameter. I checked in in the source code too, but nothing similar https://github.com/sgminer-dev/sgminer/blob/master/sgminer.c.
Is there a simple war to run it? Now I have a groestl wallet, but where can I get username from? What parameters should I use other than -k groestl and -d 1?

Probably they removed it, I'm using an older version.
I run it like this, for solo mine:

sgminer -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:GROESTLCOIN_RPC_PORT -u YOURUSER -p YOURPASSWORD

Then you have to find and add your best intensity and worksize (my OS kernel works with 256 only).
username and password are set in groestlcoin.conf; the port you can easily find in their thread (or via netstat).

realhet

newbie

Activity: 32

Merit: 0

Unfortunately there is no --benchmark parameter. I checked in in the source code too, but nothing similar https://github.com/sgminer-dev/sgminer/blob/master/sgminer.c.
Is there a simple war to run it? Now I have a groestl wallet, but where can I get username from? What parameters should I use other than -k groestl and -d 1?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: realhet on January 08, 2015, 07:55:13 AM

Does sgminer has an offline 'diagnostic' mode, just for testing the kernel if it runs and how fast it runs?

There is a simple "benchmark" option:

--benchmark Run sgminer in benchmark mode - produces no shares

realhet

newbie

Activity: 32

Merit: 0

Now I managed to build sgminer5.1 on my sys. I still have to make my kernel to work with it.

Does sgminer has an offline 'diagnostic' mode, just for testing the kernel if it runs and how fast it runs?

"I'd need the linux version of the assembler."
Sorry, it's impossible. It's not even written in Cpp just to be able to compile on any other system, than win.

And to make things more complicated Cheesy

You have to compile with it for every type of gcn cards multiplied by every Catalyst driver that was altered by AMD developers. My compiler only patches the binary into the .elf, the actual elf file is generated by the current Catalyst Driver of the currently selected gfx card.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: realhet on January 08, 2015, 04:59:22 AM

25% seems like only that loss coming back which is lost with the 14.9. I really thought you had it 3.5x faster.

Are you sure that it only uses 123VGPRS AND code size is 28KB only? Or does it started to use Scratch regs (those are terribly slow)?

Unfortunatelly I can't try on anything else than HD7770. But I'd also like to see how it runs on faster systems. I uploaded it onto my blog in the download area if someone wish to try it. I'm not familiar with the latest GCN chips (I think AMD only improve their instruction from time to time, and maybe cut down double precision performance), but with this particular program, I'm pretty sure that it will bring the 3.48x speedup on the R9 290x too. Because all the CUs can work alone using LDS and L1 cache and ICache on their own, that's why. So if current ocl code on the R9 290x runs at 20MH/s then the latest asm code should be run at 70MH/s.

25% compared to 14.6, it's 43% compared to 14.9.
No scratch reg use (when I triggered it a couple times, it slowed down to less than 1 Mh/s).
I'd like to try your asm code myself, but I'd need the linux version of the assembler.

realhet

newbie

Activity: 32

Merit: 0

25% seems like only that loss coming back which is lost with the 14.9. I really thought you had it 3.5x faster.

Are you sure that it only uses 123VGPRS AND code size is 28KB only? Or does it started to use Scratch regs (those are terribly slow)?

Unfortunatelly I can't try on anything else than HD7770. But I'd also like to see how it runs on faster systems. I uploaded it onto my blog in the download area if someone wish to try it. I'm not familiar with the latest GCN chips (I think AMD only improve their instruction from time to time, and maybe cut down double precision performance), but with this particular program, I'm pretty sure that it will bring the 3.48x speedup on the R9 290x too. Because all the CUs can work alone using LDS and L1 cache and ICache on their own, that's why. So if current ocl code on the R9 290x runs at 20MH/s then the latest asm code should be run at 70MH/s.

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: realhet on January 05, 2015, 11:39:59 PM

Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin

Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.

You first need to have the target passed to the kernel.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: realhet on January 08, 2015, 02:36:35 AM

And you have the first/last round optimizations so it must be faster!
If it's as fast as the asm version, then I don't have to deal with the kernel parameters, which is boring/painful. My asm was only needed to encourage you to shrink the code/regs. Cheesy

Can you share the new source?

Unfortunately it's only about 25% faster, but we should compare apples to apples: could you try your code on hawaii chipset so we have a constant testbed?
Now I'm working on further first round optimizations, they bring little improvement but it's still worth imho.

realhet

newbie

Activity: 32

Merit: 0

And you have the first/last round optimizations so it must be faster!
If it's as fast as the asm version, then I don't have to deal with the kernel parameters, which is boring/painful. My asm was only needed to encourage you to shrink the code/regs. Cheesy

Can you share the new source?

realhet

newbie

Activity: 32

Merit: 0

"On hawaii only, I've managed to get to 123 VGRPS and 28K ISA size, so now I have all the optimizations of the asm code :-)"

Then it got all the goodies: vgprs, icache and 2ram+6lds reads. The speedup must be the same 3.5x! Is it that much?

It must be good on small cards either, only important difference is the number of CUs anyways.

utahjohn

hero member

Activity: 630

Merit: 500

new optimized CL or a BIN? (I'll test on 280x and 7950).

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

On hawaii only, I've managed to get to 123 VGRPS and 28K ISA size, so now I have all the optimizations of the asm code :-)
I believe the asm version is still faster on hawaii, and of course much faster on smaller cards.

qwep1

hero member

Activity: 610

Merit: 500

and will be a version for windows

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Does you assembler support self modifying code? Wink

Then you can use the instruction cache as a precalc buffer as well. The advantage is that most gpu's can read from the inst cache in paralell to the level 1 cache.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: realhet on January 05, 2015, 11:39:59 PM

Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin

Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.

Thanks for the update.
I've been using Linux only for many years now, so I can't help you on windows compiling; just know it's trivial to compile the miner on linux, it runs on a terminal so doesn't need qt.
About the software version, I prefer the good old sph-sgminer which is based on sgminer 4.1, (I modified it a bit) but you can use the latest sgminer 5.X as well.
To test the kernel you can simply point it to a pool, printf the hash or whatever.
Back to my opencl effort, I've reduced the number of vgprs to 147 but I'm struggling to get past that.

realhet

newbie

Activity: 32

Merit: 0

(Oups an important part was missing in my blogpost -> now it's corrected)

realhet

newbie

Activity: 32

Merit: 0

Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin

Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.

Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 14. (Read 61261 times)