[ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner - page 23.

Eliovp

legendary

Activity: 1050

Merit: 1294

Huh?

Quote from: go6ooo1212 on January 16, 2016, 06:24:31 PM

Quote from: Eliovp on January 16, 2016, 06:06:08 PM

Quote from: go6ooo1212 on January 16, 2016, 05:51:28 PM

Has someone been testing the hash-power of R9 Nano , on Ethereum ?

Sure

I have

26Mh stock

Pf, I've expected some more improvement , compared to 7970/280X

You're not the only one...

go6ooo1212

legendary

Activity: 1512

Merit: 1000

quarkchain.io

Quote from: Eliovp on January 16, 2016, 06:06:08 PM

Quote from: go6ooo1212 on January 16, 2016, 05:51:28 PM

Has someone been testing the hash-power of R9 Nano , on Ethereum ?

Sure

I have

26Mh stock

Pf, I've expected some more improvement , compared to 7970/280X

Eliovp

legendary

Activity: 1050

Merit: 1294

Huh?

Quote from: go6ooo1212 on January 16, 2016, 05:51:28 PM

Has someone been testing the hash-power of R9 Nano , on Ethereum ?

Sure

I have

26Mh stock

go6ooo1212

legendary

Activity: 1512

Merit: 1000

quarkchain.io

Has someone been testing the hash-power of R9 Nano , on Ethereum ?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Wolf0 I'm curious to know if you tried that technique (split to multiple work items) on a kernel and how was the outcome.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Wolf0 on January 15, 2016, 07:35:52 AM

No, it is the OPPOSITE of vector size. You don't get how the GPU is ACTUALLY supposed to solve issues, I don't think - it really doesn't fucking like large code size, or very complex problems in one work-item - you know this.

Vectors were profitable before because of the old architectures - VLIW based. GCN abolished hardware vectors, and instead made VGPRs 4 bytes. Why, you may ask? Occupancy! This way, if you need to work on a problem that can't be efficiently vectorized like that, you don't waste most of your VGPR.

But, but, but... mah parallelism! GCN has you covered - you just need to think of the shit differently. Instead of parallelizing in vectors, do it in work-items. To give you the cleanest example I've worked with demonstrating this (in X11), take Echo-512.

You have a 256 byte state which I'll now refer to as W. W can be represented as an array of 16 uint4s. If you're looking at the shitty darkcoin-mod.cl trying to visualize this, just look at the 64-bit W vars and imagine them as 32-bit, and an array. Now, if I was going to demonstrate this technique with Echo - I have an array of 4 uint4s. This is my W. To figure out which part of the hash you are, you can choose two ways: launch the kernel with throughput * 4, 1, 1 local size, or do throughput, 4, 1 local size. Since the latter is cleaner, I'll assume that notation: lid = get_local_id(0), and hashid = get_local_id(1).

if hashid is < 2 (i.e. 0 or 1) - we fill up W with (512, 0, 0, 0) (uint4, remember) over all four array indices. If hashid == 2, W becomes the input (input being 16 uints, this may be represented as 4 uint4s, as well), and if hashid == 3, we fill up W with the odds & ends - for X11, these are (0x80, 0, 0, 0) for W[0], (0, 0, 0, 0) for W[1], (0, 0, 0, 0x02000000) for W[2], and (512, 0,0, 0,) for W[3]. Now, go pull up darkcoin-mod.cl, and look at it until the this and the previous paragraph make sense.

I'll continue with rounds and output calculation in another post in just a bit.

Thanks for the explanation: let me put it in simple words so you can easily understad if I got it or not :-)
Basicly you are dividing the state in 4 parts which will be computed by different work items. Less regs per kernel leads to more waves and generally better parallelism.
Looking at echo, there is a good amount of work which can be done on a single slice of the input, but in this case:

#define BIG_SHIFT_ROWS do { \
SHIFT_ROW1(1, 5, 9, D); \
SHIFT_ROW2(2, 6, A, E); \
SHIFT_ROW3(3, 7, B, F); \
}

there are overlaps. I assume you'll use LDS to communicate between work items, or maybe shuffle but that would need assembly.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Wolf0 on January 15, 2016, 06:54:18 AM

Quote from: pallas on January 15, 2016, 06:50:22 AM

Quote from: Wolf0 on January 15, 2016, 05:57:45 AM

Quote from: pallas on January 15, 2016, 05:47:39 AM

Quote from: Wolf0 on January 15, 2016, 05:23:51 AM

Quote from: pallas on January 15, 2016, 05:06:10 AM

Quote from: Wolf0 on January 14, 2016, 04:53:57 PM

Quote from: pallas on January 14, 2016, 03:25:44 PM

That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)

EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.

EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.

Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)

I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.)

Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it?

SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work.

GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.

You might not be looking at the big picture with Groestl - look at that fucking shitty amount of waves in flight you get due to LDS (ab)use.

That's an issue with <= tahiti only, hence why I hate optimizing for those chips ;-)

Not the case - two waves in flight, and your kernel is STILL not actually using the GPU's parallelism like it's supposed to be. One Groestl-512 hash is a big job, and it's parallelizable. If you're doing a throughput of 64 hashes per local workgroup, then use 256 for Groestl, and do 4 work-items per actual hash. Tune to taste.

I understand what you mean: it's like the good old cgminer "vector size". I will think about it.
Besides, I haven't worked on groestl for a long while, but on whirlpool and variants I can easily get 3 waves on >= hawaii.
It's a lighter job, I know, but I haven't had any interest in developing groestl recently.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Wolf0 on January 15, 2016, 05:57:45 AM

Quote from: pallas on January 15, 2016, 05:47:39 AM

Quote from: Wolf0 on January 15, 2016, 05:23:51 AM

Quote from: pallas on January 15, 2016, 05:06:10 AM

Quote from: Wolf0 on January 14, 2016, 04:53:57 PM

Quote from: pallas on January 14, 2016, 03:25:44 PM

That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)

EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.

EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.

Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)

I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.)

Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it?

SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work.

GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.

You might not be looking at the big picture with Groestl - look at that fucking shitty amount of waves in flight you get due to LDS (ab)use.

That's an issue with <= tahiti only, hence why I hate optimizing for those chips ;-)

ldw-com

full member

Activity: 229

Merit: 100

Quote from: z0n0 on January 15, 2016, 06:38:57 AM

Quote from: ldw-com on January 15, 2016, 06:37:48 AM

Quote from: z0n0 on January 15, 2016, 06:33:56 AM

pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl

Grin

Oh, is that what you're reading?

Well, over here it looks like this "这是我读一堆奇怪的迹象"

:p

hahaha, yeah or that

Hey guys, whats the hash speed at X11 with R9 380 (4GB sapphire)?

I'm getting 18-19Mhz with a 390 so around 17 i guess?

z0n0

legendary

Activity: 1274

Merit: 1006

Quote from: ldw-com on January 15, 2016, 06:37:48 AM

Quote from: z0n0 on January 15, 2016, 06:33:56 AM

pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl

Grin

Oh, is that what you're reading?

Well, over here it looks like this "这是我读一堆奇怪的迹象"

:p

hahaha, yeah or that

Hey guys, whats the hash speed at X11 with R9 380 (4GB sapphire)?

ldw-com

full member

Activity: 229

Merit: 100

Quote from: z0n0 on January 15, 2016, 06:33:56 AM

pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl

Grin

Oh, is that what you're reading?

Well, over here it looks like this "这是我读一堆奇怪的迹象"

:p

z0n0

legendary

Activity: 1274

Merit: 1006

pallas & Wolf0:
when I read your (coding 'n shit) dialog, here is what I read: kjndcknsdvnlsdnvlsvlsdlkvlksdmvlmskcnsjvnljsnvljsnvlsdclkslvjhsdljvnlxnvcdnvl

Grin

24core

sr. member

Activity: 582

Merit: 250

An Impressive Purely Anonymous Currency.

Quote from: Foss on January 05, 2016, 09:47:05 AM

Quote from: 24core on January 05, 2016, 04:45:52 AM

Quote from: rednoW on January 05, 2016, 03:11:14 AM

Quote from: sp_ on January 05, 2016, 12:29:26 AM

But with Xintensity at 1024 you get invalid shares, so lower hashrate on the pool. X 256 or 512 is the highest I can use..

Default was 640. I used this w/o a problem )))

Can you please share your config for the 7950 ?

Also slightly confused about the kachur miner as NiceHash download os 5.2.1, is there another link ?

https://github.com/nicehash/NiceHashMiner/releases use sgminer-5-1-0-optimized

Code:

setx GPU_MAX_ALLOC_PERCENT 100
sgminer.exe --algorithm darkcoin-mod -o stratum+tcp://x11.eu.nicehash.com:3336 -u 1M948TedPdVkbk59TMnYJhtW5BFTXKPfFL -p d=0.04 -d 0 --xintensity 640 -g 1 -w 64 --gpu-memclock 1500 --gpu-engine 1100 -s 0 --expiry 10 --queue 0 --gpu-powertune 10 --keccak-unroll 0 --hamsi-expand-big 4 --gpu-fan 30-95 --temp-cutoff 95 --temp-overheat 90 --temp-target 75 --auto-fan

I tried those settings and many others, but I get a very low Hash rate and my CPU is going mental.

I only have dual core Intel G3220 @ 3Ghz - Does the optimised miner require extra CPU and hence my mining rig requires a CPU upgrade ?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Wolf0 on January 15, 2016, 05:23:51 AM

Quote from: pallas on January 15, 2016, 05:06:10 AM

Quote from: Wolf0 on January 14, 2016, 04:53:57 PM

Quote from: pallas on January 14, 2016, 03:25:44 PM

That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)

EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.

EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.

Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)

I can't tell - without straight up replacement of a kernel, I dunno if he's done some kind of fuckery with part of a hash in one kernel, and part in another, for example. What I suspect is SIMD has been cut into two parts (at least.)

Now, even if his Groestl is faster than mine, my current Groestl is outdated anyways. My R & D area has a bitsliced Groestl that I have not yet played too much with - parallelization using 4 work items like it's done in CUDA should be possible. I can drop to GCN ASM for ds_swizzle_b32 - limits me to a 4-way, as it's not a 32-way shuffle like CUDA, but it's enough for me. I've just got a lot to do atm - maybe there is something we could work on together... a Groestl, perhaps? If you could look at the code and see if you could split it over multiple work-items and use LDS for the data sharing, I could probably remove said LDS usage by disassembling and modifying the kernel before reassembling it?

SIMD: tonight I was thinking about it and slicing into two parts is the natural way of doing it; I think I could try that. The only little annoyance is that the data to be passed between the (at least two) parts won't just be a hash but a bigger set of data, so the standard sgminer searchX() system wouldn't work.

GROESTL (and similar): I always had the idea that nvidia had to do the bitslice thing because shared memory was slower than on GCN; in fact nvidia bitsliced is on par with GCN LDS. As a logical consequence, I think that if bitslice on GCN is presumed to be slower than on nvidia, I wouldn't even try it.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Wolf0 on January 14, 2016, 04:53:57 PM

Quote from: pallas on January 14, 2016, 03:25:44 PM

That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

I don't know - I know one thing for certain now, though - Kachur's Blake was about the same speed as mine, his BMW, however, needed a bit of work. Mine improved overall X11 hash by 1.35% (remember that BMW itself is quite a small part of X11, so the raw improvement in my BMW over his is much larger.)

EDIT: search2, originally Groestl-512, did not take to a simple kernel replacement and will have to be investigated further (manual study of the disassembly.) Skipping for now.

EDIT2: search2 may not have been fucked because of a difference in output, but in where the bloody constants are in global. For JH, I'm going to make an all new test kernel which takes a read-only buffer for JH's constants rather than trying to reference constant memory right now. Simpler. I should then be able to put that in place of the Kachur JH and modify SGMiner to pass a constant buffer on that kernel.

Is search2 faster than yours? or is it just simd?
Maybe Kachur has found a way to make AES-like algos better...
BTW I wouldn't mind a frankenbin if it's faster and stable ;-)

kiffner

hero member

Activity: 505

Merit: 500

Can some one add scrypt-jane algo to sgminer?

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: Wolf0 on January 14, 2016, 02:27:54 PM

Quote from: pallas on January 14, 2016, 02:19:32 PM

Wolf0, how fast do you think your public bins would run if replacing simd with kachur version? Would it be faster than full kachur? If yes, it might be the best kernel for public consumption.

I know it would be if I worked on it some - I tested my Echo implementation vs. Kachur's, mine is faster. I feel kinda let down.

That shows the importance of simd. Now you know where to work ;-)
Thanks for the information, please keep us updated on the progress!
Please tell me if I can be of any help.

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Wolf0, how fast do you think your public bins would run if replacing simd with kachur version? Would it be faster than full kachur? If yes, it might be the best kernel for public consumption.

z0n0

legendary

Activity: 1274

Merit: 1006

Quote from: Wolf0 on January 13, 2016, 02:17:44 PM

Quote from: z0n0 on January 13, 2016, 01:57:13 PM

Quote from: Wolf0 on January 13, 2016, 12:08:37 PM

Quote from: pallas on January 13, 2016, 11:43:54 AM

Quote from: ldw-com on January 13, 2016, 11:16:28 AM

Quote from: pallas on January 13, 2016, 10:04:02 AM

Still selling my Hawaii binary for myr-groestl.

Here the specs for your reference:

Optimised kernel: myriad-groestl (myr-groestl, groestl512 + sha256) for digibyte, myriad, saffroncoin, joincoin, trinity and others.
Speed: 63 Mh/s on r9 290x @1100/150. It is compatible with stock sgminer and includes free future upgrades.

PM for details.

Oh, i thought you were an opensource kind of guy. Guess I miss-read that. Grin

Supporting opensource doesn't mean all my work must be open. I've spent al lot of time (and I mean thousands of hours) on opensource, supporting linux (since kernel version 0.99), etc.
Oh well I shouldn't need to explain anything!

I can has? Full disclosure: I intend to disassemble it. But I won't make any results public.

Funny question: Wolf0, how old are you?

21, why?

No reason, just curious

So young, but genious.

Keep the good work man.

24core

sr. member

Activity: 582

Merit: 250

An Impressive Purely Anonymous Currency.

Quote from: Wolf0 on January 13, 2016, 02:17:44 PM

Quote from: z0n0 on January 13, 2016, 01:57:13 PM

Quote from: Wolf0 on January 13, 2016, 12:08:37 PM

Quote from: pallas on January 13, 2016, 11:43:54 AM

Quote from: ldw-com on January 13, 2016, 11:16:28 AM

Quote from: pallas on January 13, 2016, 10:04:02 AM

Still selling my Hawaii binary for myr-groestl.

Here the specs for your reference:

Optimised kernel: myriad-groestl (myr-groestl, groestl512 + sha256) for digibyte, myriad, saffroncoin, joincoin, trinity and others.
Speed: 63 Mh/s on r9 290x @1100/150. It is compatible with stock sgminer and includes free future upgrades.

PM for details.

Oh, i thought you were an opensource kind of guy. Guess I miss-read that. Grin

Supporting opensource doesn't mean all my work must be open. I've spent al lot of time (and I mean thousands of hours) on opensource, supporting linux (since kernel version 0.99), etc.
Oh well I shouldn't need to explain anything!

I can has? Full disclosure: I intend to disassemble it. But I won't make any results public.

Funny question: Wolf0, how old are you?

21, why?

I feel old...41 - lol

Topic: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner - page 23. (Read 877880 times)