Pages:
Author

Topic: [ANN][GRS][DMD][DGB] Pallas optimized groestl opencl kernels - page 14. (Read 61229 times)

legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?

Intensity 24 is too much, I'd stay between 20 and 22, otherwise you'll produce a lot of rejected shares (or orphans if solo mining).
The shaders option is ignored for groestl.
The hashrate should be calculated on the full computation, i.e. 2 chained hashes.
What kernel are you using?
hero member
Activity: 630
Merit: 500
Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?
Are you sure it is running your kernel?.  Look in your sgminer dir, for a .bin file generated by OCL it may be running default groestlcoin OCL.  delete .bin and replace with your own of same name generated, it will not be regenerated it it exists in dir.  you must delete .bin whenever you change configs to force OCL recompile ... but you don't want that, u want to run your asm kernel ... so will have to figure out the parameter passing from sgminer ...
newbie
Activity: 32
Merit: 0
Thanks for help. Now it runs, and I found that this command produces the best results:
sgminer -d 1 -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:1441 -u u -p p --shaders 1280 --worksize 256 -g 1 --intensity 24

It produces (avg)2MH/s which is the half of the 4Mh/s I calculated earlier.
Does sgminer divides the Groestl-hash calculation number by 2? Although, It would be more reasonable.
Or something is really wrong, that It runs on half speed (exactly hald speed)?
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Unfortunately there is no --benchmark parameter. I checked in in the source code too, but nothing similar https://github.com/sgminer-dev/sgminer/blob/master/sgminer.c.
Is there a simple war to run it? Now I have a groestl wallet, but where can I get username from? What parameters should I use other than -k groestl and -d 1?

Probably they removed it, I'm using an older version.
I run it like this, for solo mine:

sgminer -k groestlcoin --difficulty-multiplier 0.0039062500 -o http://localhost:GROESTLCOIN_RPC_PORT -u YOURUSER -p YOURPASSWORD

Then you have to find and add your best intensity and worksize (my OS kernel works with 256 only).
username and password are set in groestlcoin.conf; the port you can easily find in their thread (or via netstat).
newbie
Activity: 32
Merit: 0
Unfortunately there is no --benchmark parameter. I checked in in the source code too, but nothing similar https://github.com/sgminer-dev/sgminer/blob/master/sgminer.c.
Is there a simple war to run it? Now I have a groestl wallet, but where can I get username from? What parameters should I use other than -k groestl and -d 1?
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Does sgminer has an offline 'diagnostic' mode, just for testing the kernel if it runs and how fast it runs?

There is a simple "benchmark" option:

--benchmark         Run sgminer in benchmark mode - produces no shares
newbie
Activity: 32
Merit: 0
Now I managed to build sgminer5.1 on my sys. I still have to make my kernel to work with it.

Does sgminer has an offline 'diagnostic' mode, just for testing the kernel if it runs and how fast it runs?

"I'd need the linux version of the assembler."
Sorry, it's impossible. It's not even written in Cpp just to be able to compile on any other system, than win.

And to make things more complicated Cheesy You have to compile with it for every type of gcn cards multiplied by every Catalyst driver that was altered by AMD developers. My compiler only patches the binary into the .elf, the actual elf file is generated by the current Catalyst Driver of the currently selected gfx card.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
25% seems like only that loss coming back which is lost with the 14.9. I really thought you had it 3.5x faster.

Are you sure that it only uses 123VGPRS AND code size is 28KB only? Or does it started to use Scratch regs (those are terribly slow)?

Unfortunatelly I can't try on anything else than HD7770. But I'd also like to see how it runs on faster systems. I uploaded it onto my blog in the download area if someone wish to try it. I'm not familiar with the latest GCN chips (I think AMD only improve their instruction from time to time, and maybe cut down double precision performance), but with this particular program, I'm pretty sure that it will bring the 3.48x speedup on the R9 290x too. Because all the CUs can work alone using LDS and L1 cache and ICache on their own, that's why. So if current ocl code on the R9 290x runs at 20MH/s then the latest asm code should be run at 70MH/s.

25% compared to 14.6, it's 43% compared to 14.9.
No scratch reg use (when I triggered it a couple times, it slowed down to less than 1 Mh/s).
I'd like to try your asm code myself, but I'd need the linux version of the assembler.
newbie
Activity: 32
Merit: 0
25% seems like only that loss coming back which is lost with the 14.9. I really thought you had it 3.5x faster.

Are you sure that it only uses 123VGPRS AND code size is 28KB only? Or does it started to use Scratch regs (those are terribly slow)?

Unfortunatelly I can't try on anything else than HD7770. But I'd also like to see how it runs on faster systems. I uploaded it onto my blog in the download area if someone wish to try it. I'm not familiar with the latest GCN chips (I think AMD only improve their instruction from time to time, and maybe cut down double precision performance), but with this particular program, I'm pretty sure that it will bring the 3.48x speedup on the R9 290x too. Because all the CUs can work alone using LDS and L1 cache and ICache on their own, that's why. So if current ocl code on the R9 290x runs at 20MH/s then the latest asm code should be run at 70MH/s.
member
Activity: 81
Merit: 1002
It was only the wind.
Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.

You first need to have the target passed to the kernel.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
And you have the first/last round optimizations so it must be faster!
If it's as fast as the asm version, then I don't have to deal with the kernel parameters, which is boring/painful. My asm was only needed to encourage you to shrink the code/regs. Cheesy

Can you share the new source?

Unfortunately it's only about 25% faster, but we should compare apples to apples: could you try your code on hawaii chipset so we have a constant testbed?
Now I'm working on further first round optimizations, they bring little improvement but it's still worth imho.
newbie
Activity: 32
Merit: 0
And you have the first/last round optimizations so it must be faster!
If it's as fast as the asm version, then I don't have to deal with the kernel parameters, which is boring/painful. My asm was only needed to encourage you to shrink the code/regs. Cheesy

Can you share the new source?
newbie
Activity: 32
Merit: 0
"On hawaii only, I've managed to get to 123 VGRPS and 28K ISA size, so now I have all the optimizations of the asm code :-)"

Then it got all the goodies: vgprs, icache and 2ram+6lds reads. The speedup must be the same 3.5x! Is it that much?

It must be good on small cards either, only important difference is the number of CUs anyways.
hero member
Activity: 630
Merit: 500
new optimized CL or a BIN? (I'll test on 280x and 7950).
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
On hawaii only, I've managed to get to 123 VGRPS and 28K ISA size, so now I have all the optimizations of the asm code :-)
I believe the asm version is still faster on hawaii, and of course much faster on smaller cards.
hero member
Activity: 610
Merit: 500
and will be a version for windows  Smiley
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
Does you assembler support self modifying code? Wink Then you can use the instruction cache as a precalc buffer as well. The advantage is that most gpu's can read from the inst cache in paralell to the level 1 cache.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.

Thanks for the update.
I've been using Linux only for many years now, so I can't help you on windows compiling; just know it's trivial to compile the miner on linux, it runs on a terminal so doesn't need qt.
About the software version, I prefer the good old sph-sgminer which is based on sgminer 4.1, (I modified it a bit)  but you can use the latest sgminer 5.X as well.
To test the kernel you can simply point it to a pool, printf the hash or whatever.
Back to my opencl effort, I've reduced the number of vgprs to 147 but I'm struggling to get past that.
newbie
Activity: 32
Merit: 0
(Oups an important part was missing in my blogpost -> now it's corrected)
newbie
Activity: 32
Merit: 0
Hi, I think I'm done with the things I wanted to try. It's at 3.48x now Grin Check the second part of the optimizations: http://realhet.wordpress.com/
It's really cool that how the ALU, the LDS and the L1 cache can cooperate on the same job.

Let's discuss that how my kernel can be used in the miner program. I'm an absolute noob with mining so pls help me. Is it the popular sg-miner? Can I compile it with Qt5.3 with MSVC? Or maybe under Visual Studio Express? Do you have actual test vectors to test it? I wanna make sure if it calculates 100% correctly. And can't wait to see if it really goes 70MH/s on a 290x beast.
Pages:
Jump to: