[ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner - page 105.

nicehash

legendary

Activity: 885

Merit: 1006

NiceHash.com

Quote from: yudhistira on January 14, 2015, 12:43:47 AM

whats wrong with this, please help

Quote

setx GPU_MAX_ALLOC_PERCENT 100
setx GPU_USE_SYNC_OBJECTS 1

sgminer.exe --kernel bitblock -o stratum+tcp://stratum.westhash.com:3336 -u 12tyqFRW384n27ytyM77edUignkpGDDbZ9 -p d=0.001 -I 18 --worksize 64 -g 2 --gpu-powertune 20 --gpu-engine 1130 --gpu-memclock 1500 --lookup-gap 2 --auto-fan --gpu-fan 40-70 --temp-cutoff 85 --temp-overheat 80

Port 3336 is for X11 algorithm, use "--kernel darkcoin-mod".

yudhistira

full member

Activity: 347

Merit: 100

whats wrong with this, please help

Quote

setx GPU_MAX_ALLOC_PERCENT 100
setx GPU_USE_SYNC_OBJECTS 1

sgminer.exe --kernel bitblock -o stratum+tcp://stratum.westhash.com:3336 -u 12tyqFRW384n27ytyM77edUignkpGDDbZ9 -p d=0.001 -I 18 --worksize 64 -g 2 --gpu-powertune 20 --gpu-engine 1130 --gpu-memclock 1500 --lookup-gap 2 --auto-fan --gpu-fan 40-70 --temp-cutoff 85 --temp-overheat 80

DragonSlayer

full member

Activity: 181

Merit: 100

I am getting an error while compiling in VS2013.

error LNK1104: cannot open file 'jansson.lib'

Anyone know what I am doing wrong?

Thanks

semajjames

hero member

Activity: 528

Merit: 500

Quote from: Zuikkis on December 27, 2014, 05:54:09 AM

Try memclock 1500, or if that is not stable, try 1250.

AMD gpus have memory timings set at 125MHz intervals, always ending at even 125MHz. Like this:

875-1000
1001-1125
1126-1250
1251-1375
1376-1500
1501-1625

mining is mostly about random access latency, not about sequential reads. You get best random access speed at the high end of each range, like 1250 or 1500.

It's very likely that 1250MHz is faster than 1400MHz, because 1400MHz is so close to the start of the range and 1250 is exactly at the end..

So these figures :-

875-1000
1001-1125
1126-1250
1251-1375
1376-1500
1501-1625

are these only for a gpu with stock engine of 875 or would these figures still be correct for a gpu with stock engine of 947 ect ?

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: semajjames on December 15, 2014, 03:01:03 PM

Quote from: djm34 on December 15, 2014, 02:45:39 PM

Quote from: semajjames on December 15, 2014, 02:06:41 PM

I there some issue with 290x on Lyra2RE ,,because i cant get my 290x's to hash as fast as my 290 non x's

can you give the numbers for both as well as the setting

"290 non x Elipa 1.488 Mh/s

   "gpu-engine" : "980",
   "gpu-memclock" : "1500",
   "xintensity" : "64",
   "nfactor" : "10",
   "worksize" : "64",
   "algorithm" : "Lyra2RE",

290x hynix 1.450 Mh/s

   "gpu-engine" : "1070",
   "gpu-memclock" : "1400",
   "xintensity" : "64",
   "algorithm" : "Lyra2RE",
   "worksize" : "64",

I've had the clock speeds all over the place it just seems like the 290x's don't want to leave 1.450 Mh/s

and yet the 290's are happy to run faster easyer

Drop core, raise memclk.

damm315er

sr. member

Activity: 539

Merit: 255

Quote from: cat77 on January 10, 2015, 12:10:35 PM

Very interesting. I get about 2% gain on 7950 and need to use (mod % 2) with the case statements adjusted accordingly.
My 280X gains almost 6% as is, but the gain difference between (mod % 2) and (mod % 4) is pretty small, like 1-2 KHs

My SMix call is a bit different, I simply put the sub-calls inline so it doesn't bother with ScratchpadStore and ScratchpadMix.
Perhaps this fits nicer into the core and needs less swapping.

I have tried, unsuccessfully, to further streamline the SMix, but any other way I do it, its either all HW errors or vastly slower. Any guidance here would be appreciated.

Code:

void SMix(ulong16 *X, __global ulong16 *V, bool flag)
{
  int i = 0;
  int idx;

   while (i^256)
   {
   V[i++] = X[0];
V[i++] = X[1];
   neoscrypt_blkmix(X, flag);
   }
   do {
   idx = (( (uint *)X)[48] & 0x7F) << 1;
X[0] ^= V[idx];
X[1] ^= V[idx+1];
   neoscrypt_blkmix(X, flag);
   } while (i-=2);
}

I gotta wonder if you touched on what the bottleneck is in wolf's kernel for the 290/290x, and that's why the speed dropped..

cat77

newbie

Activity: 18

Merit: 0

Very interesting. I get about 2% gain on 7950 and need to use (mod % 2) with the case statements adjusted accordingly.
My 280X gains almost 6% as is, but the gain difference between (mod % 2) and (mod % 4) is pretty small, like 1-2 KHs

My SMix call is a bit different, I simply put the sub-calls inline so it doesn't bother with ScratchpadStore and ScratchpadMix.
Perhaps this fits nicer into the core and needs less swapping.

I have tried, unsuccessfully, to further streamline the SMix, but any other way I do it, its either all HW errors or vastly slower. Any guidance here would be appreciated.

Code:

void SMix(ulong16 *X, __global ulong16 *V, bool flag)
{
  int i = 0;
  int idx;

   while (i^256)
   {
   V[i++] = X[0];
V[i++] = X[1];
   neoscrypt_blkmix(X, flag);
   }
   do {
   idx = (( (uint *)X)[48] & 0x7F) << 1;
X[0] ^= V[idx];
X[1] ^= V[idx+1];
   neoscrypt_blkmix(X, flag);
   } while (i-=2);
}

damm315er

sr. member

Activity: 539

Merit: 255

Quote from: cat77 on January 09, 2015, 11:16:31 PM

.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....

change the XORBytesInPlace call from

Code:

XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);

to

Code:

XORBytesInPlace(B + bufidx, input, bufidx);

and change the function itself to perform some byte alignment checking

Code:

//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
switch(mod % 4)
{
case 0:
#pragma unroll 2
for(int i = 0; i < 4; i+=2)
{
((uint2 *)dst)[i] ^= ((uint2 *)src)[i];
((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];
}
break;

case 2:
#pragma unroll 8
for(int i = 0; i < 16; i+=2)
{
((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
}
break;

default:
#pragma unroll 8
for(int i = 0; i < 31; i+=4)
{
((uchar *)dst)[i] ^= ((uchar *)src)[i];
((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];
}
}
}

This actually drops the hashrate by 5 kh/s on the 290's, but combined with the bobben2 mod increases by about 9 kh/s.

YMMV

KL0nLutiy

member

Activity: 158

Merit: 10

Quote from: cat77 on January 09, 2015, 11:16:31 PM

.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....

change the XORBytesInPlace call from

Code:

XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);

to

Code:

XORBytesInPlace(B + bufidx, input, bufidx);

and change the function itself to perform some byte alignment checking

Code:

//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
switch(mod % 4)
{
case 0:
#pragma unroll 2
for(int i = 0; i < 4; i+=2)
{
((uint2 *)dst)[i] ^= ((uint2 *)src)[i];
((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];
}
break;

case 2:
#pragma unroll 8
for(int i = 0; i < 16; i+=2)
{
((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
}
break;

default:
#pragma unroll 8
for(int i = 0; i < 31; i+=4)
{
((uchar *)dst)[i] ^= ((uchar *)src)[i];
((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];
}
}
}

What settings do you use?

Eastwind

hero member

Activity: 896

Merit: 1000

Quote from: cat77 on January 09, 2015, 11:16:31 PM

.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....

change the XORBytesInPlace call from

Code:

XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);

to

Code:

XORBytesInPlace(B + bufidx, input, bufidx);

and change the function itself to perform some byte alignment checking

Code:

//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
switch(mod % 4)
{
case 0:
#pragma unroll 2
for(int i = 0; i < 4; i+=2)
{
((uint2 *)dst)[i] ^= ((uint2 *)src)[i];
((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];
}
break;

case 2:
#pragma unroll 8
for(int i = 0; i < 16; i+=2)
{
((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
}
break;

default:
#pragma unroll 8
for(int i = 0; i < 31; i+=4)
{
((uchar *)dst)[i] ^= ((uchar *)src)[i];
((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];
}
}
}

Did you change from the original kernal or after boben2's change? Can you upload a revised kernal?

Wolf0

member

Activity: 81

Merit: 1002

It was only the wind.

Quote from: yudhistira on December 15, 2014, 01:15:14 PM

Thank's to Wolf0 for your .bin files

increas from 4.3 to 6.6 MHs
x11 r9 280x

but profit same as it was get 4.3 MHs
maybe diff raise like crazy

hopefully get a special .bin that not make a crazy diff Wink

Haha, that's called a bin that's not released. Tongue

cat77

newbie

Activity: 18

Merit: 0

.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....

change the XORBytesInPlace call from

Code:

XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);

to

Code:

XORBytesInPlace(B + bufidx, input, bufidx);

and change the function itself to perform some byte alignment checking

Code:

//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
switch(mod % 4)
{
case 0:
#pragma unroll 2
for(int i = 0; i < 4; i+=2)
{
((uint2 *)dst)[i] ^= ((uint2 *)src)[i];
((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];
}
break;

case 2:
#pragma unroll 8
for(int i = 0; i < 16; i+=2)
{
((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
}
break;

default:
#pragma unroll 8
for(int i = 0; i < 31; i+=4)
{
((uchar *)dst)[i] ^= ((uchar *)src)[i];
((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];
}
}
}

damm315er

sr. member

Activity: 539

Merit: 255

Quote from: bobben2 on January 06, 2015, 12:55:29 PM

Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.

You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.

Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}

void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}

To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}

CORRECTION:

That made a 8 kh/s increase on my 290's.. from 341 to 349 kh/s. (dumb azz me, I forgot to delete the bin)

tccd

newbie

Activity: 51

Merit: 0

Quote from: bobben2 on January 06, 2015, 12:55:29 PM

Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.

You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.

Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}

void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}

To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}

Not working well with Wolf0's Hawaii mod. Hash rate dropped from 339kh/s to 320kh/s.

KL0nLutiy

member

Activity: 158

Merit: 10

Quote from: bobben2 on January 06, 2015, 12:55:29 PM

Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.

You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.

Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}

void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}

To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}

thanks, increase from 317 to 324 on 290x

Eastwind

hero member

Activity: 896

Merit: 1000

Quote from: ?? on ??

Quote from: bobben2 on January 06, 2015, 03:08:48 PM

Hi again,
Now I tried the "improved" kernel on my own 280X rig.
3 cards, all running 1000/1500 core/mem. 550Watts at the wall
Orig neoscrypt kernel (Kh/s)
  301
  296
  287
My "improved" kernel
  295
  289
  276
Yiikes! I got worse performance on the 280X!
Sorry guys. This "improvement", as it stands, seems to come to the 290 only.

+1
on my 280x i get 308 with "improved" vs 317 before

The drop is about 5% compared to old kernel for 7970. Maybe this "improved kernel" works only for 290 which has larger memory and more cores.

damm315er

sr. member

Activity: 539

Merit: 255

So wolf0 isn't the only one with a kernel mod that works better on the 290's than anything else..

JuanHungLo

hero member

Activity: 935

Merit: 1001

I don't always drink...

I concur. After further testing with suggested settings I was getting HW errors. Back to the drawing board...

bobben2

full member

Activity: 279

Merit: 104

Hi again,
Now I tried the "improved" kernel on my own 280X rig.
3 cards, all running 1000/1500 core/mem. 550Watts at the wall
Orig neoscrypt kernel (Kh/s)
301
296
287
My "improved" kernel
295
289
276
Yiikes! I got worse performance on the 280X!
Sorry guys. This "improvement", as it stands, seems to come to the 290 only.

Zuikkis

newbie

Activity: 57

Merit: 0

Try to remove thread-concurrency from config, so sgminer can calculate it from the xintensity.

Edit: Oh, and worksize 128 or even 64 is probably faster.

Topic: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner - page 105. (Read 877889 times)