Pages:
Author

Topic: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner - page 23. (Read 877859 times)

legendary
Activity: 1050
Merit: 1293
Huh?
Has someone been testing the hash-power of R9 Nano , on Ethereum ?

Sure

I have

26Mh stock

 Smiley

Pf, I've expected some more improvement , compared to 7970/280X

You're not the only one...
legendary
Activity: 1512
Merit: 1000
quarkchain.io
Has someone been testing the hash-power of R9 Nano , on Ethereum ?

Sure

I have

26Mh stock

 Smiley

Pf, I've expected some more improvement , compared to 7970/280X
legendary
Activity: 1050
Merit: 1293
Huh?
Has someone been testing the hash-power of R9 Nano , on Ethereum ?

Sure

I have

26Mh stock

 Smiley
legendary
Activity: 1512
Merit: 1000
quarkchain.io
Has someone been testing the hash-power of R9 Nano , on Ethereum ?
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Wolf0 I'm curious to know if you tried that technique (split to multiple work items) on a kernel and how was the outcome.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
No, it is the OPPOSITE of vector size. You don't get how the GPU is ACTUALLY supposed to solve issues, I don't think - it really doesn't fucking like large code size, or very complex problems in one work-item - you know this.

Vectors were profitable before because of the old architectures - VLIW based. GCN abolished hardware vectors, and instead made VGPRs 4 bytes. Why, you may ask? Occupancy! This way, if you need to work on a problem that can't be efficiently vectorized like that, you don't waste most of your VGPR.

But, but, but... mah parallelism! GCN has you covered - you just need to think of the shit differently. Instead of parallelizing in vectors, do it in work-items. To give you the cleanest example I've worked with demonstrating this (in X11), take Echo-512.

You have a 256 byte state which I'll now refer to as W. W can be represented as an array of 16 uint4s. If you're looking at the shitty darkcoin-mod.cl trying to visualize this, just look at the 64-bit W vars and imagine them as 32-bit, and an array. Now, if I was going to demonstrate this technique with Echo - I have an array of 4 uint4s. This is my W. To figure out which part of the hash you are, you can choose two ways: launch the kernel with throughput * 4, 1, 1 local size, or do throughput, 4, 1 local size. Since the latter is cleaner, I'll assume that notation: lid = get_local_id(0), and hashid = get_local_id(1).

if hashid is < 2 (i.e. 0 or 1) - we fill up W with (512, 0, 0, 0) (uint4, remember) over all four array indices. If hashid == 2, W becomes the input (input being 16 uints, this may be represented as 4 uint4s, as well), and if hashid == 3, we fill up W with the odds & ends - for X11, these are (0x80, 0, 0, 0) for W[0], (0, 0, 0, 0) for W[1], (0, 0, 0, 0x02000000) for W[2], and (512, 0,0, 0,) for W[3]. Now, go pull up darkcoin-mod.cl, and look at it until the this and the previous paragraph make sense.

I'll continue with rounds and output calculation in another post in just a bit.

Thanks for the explanation: let me put it in simple words so you can easily understad if I got it or not :-)
Basicly you are dividing the state in 4 parts which will be computed by different work items. Less regs per kernel leads to more waves and generally better parallelism.
Looking at echo, there is a good amount of work which can be done on a single slice of the input, but in this case:

#define BIG_SHIFT_ROWS   do { \
    SHIFT_ROW1(1, 5, 9, D); \
    SHIFT_ROW2(2, 6, A, E); \
    SHIFT_ROW3(3, 7, B, F); \
  }

there are overlaps. I assume you'll use LDS to communicate between work items, or maybe shuffle but that would need assembly.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)

EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.

EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.

Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)

I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.)

Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it?

SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work.

GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.

You might not be looking at the big picture with Groestl - look at that fucking shitty amount of waves in flight you get due to LDS (ab)use.

That's an issue with <= tahiti only, hence why I hate optimizing for those chips ;-)

Not the case - two waves in flight, and your kernel is STILL not actually using the GPU's parallelism like it's supposed to be. One Groestl-512 hash is a big job, and it's parallelizable. If you're doing a throughput of 64 hashes per local workgroup, then use 256 for Groestl, and do 4 work-items per actual hash. Tune to taste.

I understand what you mean: it's like the good old cgminer "vector size". I will think about it.
Besides, I haven't worked on groestl for a long while, but on whirlpool and variants I can easily get 3 waves on >= hawaii.
It's a lighter job, I know, but I haven't had any interest in developing groestl recently.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)

EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.

EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.

Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)

I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.)

Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it?

SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work.

GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.

You might not be looking at the big picture with Groestl - look at that fucking shitty amount of waves in flight you get due to LDS (ab)use.

That's an issue with <= tahiti only, hence why I hate optimizing for those chips ;-)
full member
Activity: 229
Merit: 100
pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl


 Grin




Oh, is that what you're reading?

Well, over here it looks like this "这是我读一堆奇怪的迹象"

:p

hahaha, yeah or that Smiley


Hey guys, whats the hash speed at X11 with R9 380 (4GB sapphire)?

I'm getting 18-19Mhz with a 390 so around 17 i guess?
legendary
Activity: 1274
Merit: 1006
pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl


 Grin




Oh, is that what you're reading?

Well, over here it looks like this "这是我读一堆奇怪的迹象"

:p

hahaha, yeah or that Smiley


Hey guys, whats the hash speed at X11 with R9 380 (4GB sapphire)?
full member
Activity: 229
Merit: 100
pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl


 Grin




Oh, is that what you're reading?

Well, over here it looks like this "这是我读一堆奇怪的迹象"

:p
legendary
Activity: 1274
Merit: 1006
pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl


 Grin


sr. member
Activity: 582
Merit: 250
An Impressive Purely Anonymous Currency.
But with Xintensity at 1024 you get invalid shares, so lower hashrate on the pool. X 256 or 512 is the highest I can use..
Default was 640. I used this w/o a problem )))

Can you please share your config for the 7950 ?

Also slightly confused about the  kachur miner as NiceHash download os 5.2.1, is there another link ?
https://github.com/nicehash/NiceHashMiner/releases use sgminer-5-1-0-optimized
Code:
setx GPU_MAX_ALLOC_PERCENT 100
sgminer.exe --algorithm darkcoin-mod -o stratum+tcp://x11.eu.nicehash.com:3336 -u 1M948TedPdVkbk59TMnYJhtW5BFTXKPfFL -p d=0.04 -d 0 --xintensity 640 -g 1 -w 64 --gpu-memclock 1500 --gpu-engine 1100 -s 0 --expiry 10 --queue 0 --gpu-powertune 10 --keccak-unroll 0 --hamsi-expand-big 4 --gpu-fan 30-95 --temp-cutoff 95 --temp-overheat 90 --temp-target 75 --auto-fan

I tried those settings and many others, but I get a very low Hash rate and my CPU is going mental.

I only have dual core Intel G3220 @ 3Ghz - Does the optimised miner require extra CPU and hence my mining rig requires a CPU upgrade ?
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)

EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.

EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.

Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)

I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.)

Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it?

SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work.

GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)

EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.

EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.

Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)
hero member
Activity: 505
Merit: 500
Can some one add scrypt-jane algo to sgminer?
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Wolf0, how fast do you think your public bins would run if replacing simd with kachur version? Would it be faster than full kachur? If yes, it might be the best kernel for public consumption.

I know it would be if I worked on it some - I tested my Echo implementation vs. Kachur's, mine is faster. I feel kinda let down.

That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
Wolf0, how fast do you think your public bins would run if replacing simd with kachur version? Would it be faster than full kachur? If yes, it might be the best kernel for public consumption.
legendary
Activity: 1274
Merit: 1006
Still selling my Hawaii binary for myr-groestl.

Here the specs for your reference:

Optimised kernel: myriad-groestl (myr-groestl, groestl512 + sha256) for digibyte, myriad, saffroncoin, joincoin, trinity and others.
Speed: 63 Mh/s on r9 290x @1100/150. It is compatible with stock sgminer and includes free future upgrades.

PM for details.

Oh, i thought you were an opensource kind of guy. Guess I miss-read that.  Grin

Supporting opensource doesn't mean all my work must be open. I've spent al lot of time (and I mean thousands of hours) on opensource, supporting linux (since kernel version 0.99), etc.
Oh well I shouldn't need to explain anything!

I can has? Full disclosure: I intend to disassemble it. But I won't make any results public.

Funny question: Wolf0, how old are you?

21, why?

No reason, just curious Smiley
So young, but genious.

Keep the good work man.
sr. member
Activity: 582
Merit: 250
An Impressive Purely Anonymous Currency.
Still selling my Hawaii binary for myr-groestl.

Here the specs for your reference:

Optimised kernel: myriad-groestl (myr-groestl, groestl512 + sha256) for digibyte, myriad, saffroncoin, joincoin, trinity and others.
Speed: 63 Mh/s on r9 290x @1100/150. It is compatible with stock sgminer and includes free future upgrades.

PM for details.

Oh, i thought you were an opensource kind of guy. Guess I miss-read that.  Grin

Supporting opensource doesn't mean all my work must be open. I've spent al lot of time (and I mean thousands of hours) on opensource, supporting linux (since kernel version 0.99), etc.
Oh well I shouldn't need to explain anything!

I can has? Full disclosure: I intend to disassemble it. But I won't make any results public.

Funny question: Wolf0, how old are you?

21, why?

I feel old...41 - lol
Pages:
Jump to: