Author

Topic: [ANN] sgminer v5 - optimized X11/X13/NeoScrypt/Lyra2RE/etc. kernel-switch miner - page 105. (Read 877889 times)

legendary
Activity: 885
Merit: 1006
NiceHash.com
whats wrong with this, please help

Quote
setx GPU_MAX_ALLOC_PERCENT 100
setx GPU_USE_SYNC_OBJECTS 1

sgminer.exe --kernel bitblock -o stratum+tcp://stratum.westhash.com:3336 -u 12tyqFRW384n27ytyM77edUignkpGDDbZ9 -p d=0.001 -I 18 --worksize 64 -g 2 --gpu-powertune 20 --gpu-engine 1130 --gpu-memclock 1500 --lookup-gap 2 --auto-fan --gpu-fan 40-70 --temp-cutoff 85 --temp-overheat 80

Port 3336 is for X11 algorithm, use "--kernel darkcoin-mod".
full member
Activity: 347
Merit: 100
whats wrong with this, please help

Quote
setx GPU_MAX_ALLOC_PERCENT 100
setx GPU_USE_SYNC_OBJECTS 1

sgminer.exe --kernel bitblock -o stratum+tcp://stratum.westhash.com:3336 -u 12tyqFRW384n27ytyM77edUignkpGDDbZ9 -p d=0.001 -I 18 --worksize 64 -g 2 --gpu-powertune 20 --gpu-engine 1130 --gpu-memclock 1500 --lookup-gap 2 --auto-fan --gpu-fan 40-70 --temp-cutoff 85 --temp-overheat 80

full member
Activity: 181
Merit: 100
I am getting an error while compiling in VS2013.

error LNK1104: cannot open file 'jansson.lib'

Anyone know what I am doing wrong?

Thanks
hero member
Activity: 528
Merit: 500
Try memclock 1500, or if that is not stable, try 1250.

AMD gpus  have memory  timings set at 125MHz intervals, always ending at even 125MHz. Like this:

875-1000
1001-1125
1126-1250
1251-1375
1376-1500
1501-1625

 mining is mostly about random access latency, not about sequential reads. You get best random access speed at the high end of each range, like 1250 or 1500.

It's very likely that 1250MHz is faster than 1400MHz, because 1400MHz is so close to the start of the range and 1250 is exactly at the end..

So these figures :-

875-1000
1001-1125
1126-1250
1251-1375
1376-1500
1501-1625

are these only for a gpu with stock engine of 875 or would these figures still be correct for a gpu with stock engine of 947 ect ?
member
Activity: 81
Merit: 1002
It was only the wind.
I there some issue with 290x on Lyra2RE ,,because i cant get my 290x's to hash as fast as my 290 non x's
can you give the numbers for both as well as the setting

"290 non x Elipa 1.488 Mh/s

          "gpu-engine" : "980",
          "gpu-memclock" : "1500",
          "xintensity" : "64",
          "nfactor" : "10",
          "worksize" : "64",
          "algorithm" : "Lyra2RE",

290x hynix 1.450 Mh/s

          "gpu-engine" : "1070",
          "gpu-memclock" : "1400",
          "xintensity" : "64",
          "algorithm" : "Lyra2RE",
          "worksize" : "64",


I've had the clock speeds all over the place it just seems like the 290x's don't want to leave 1.450 Mh/s

and yet the 290's are happy to run faster easyer


Drop core, raise memclk.
sr. member
Activity: 539
Merit: 255
Very interesting.   I get about 2% gain on 7950 and need to use (mod % 2) with the case statements adjusted accordingly.
My 280X gains almost 6% as is, but the gain difference between (mod % 2) and (mod % 4) is pretty small, like 1-2 KHs

My SMix call is a bit different, I simply put the sub-calls inline so it doesn't bother with ScratchpadStore and ScratchpadMix.
Perhaps this fits nicer into the core and needs less swapping.  

I have tried, unsuccessfully, to further streamline the SMix, but any other way I do it, its either all HW errors or vastly slower. Any guidance here would be appreciated.

Code:
void SMix(ulong16 *X, __global ulong16 *V, bool flag)
{
  int i = 0;
  int idx;

    while (i^256)
    {
      V[i++]   = X[0];
       V[i++]   = X[1];      
        neoscrypt_blkmix(X, flag);
    }
    do {      
        idx = (( (uint *)X)[48] & 0x7F) << 1;
       X[0] ^= V[idx];
       X[1] ^= V[idx+1];
        neoscrypt_blkmix(X, flag);    
    }   while (i-=2);
}
 

I gotta wonder if you touched on what the bottleneck is in wolf's kernel for the 290/290x, and that's why the speed dropped..
newbie
Activity: 18
Merit: 0
Very interesting.   I get about 2% gain on 7950 and need to use (mod % 2) with the case statements adjusted accordingly.
My 280X gains almost 6% as is, but the gain difference between (mod % 2) and (mod % 4) is pretty small, like 1-2 KHs

My SMix call is a bit different, I simply put the sub-calls inline so it doesn't bother with ScratchpadStore and ScratchpadMix.
Perhaps this fits nicer into the core and needs less swapping.  

I have tried, unsuccessfully, to further streamline the SMix, but any other way I do it, its either all HW errors or vastly slower. Any guidance here would be appreciated.

Code:
void SMix(ulong16 *X, __global ulong16 *V, bool flag)
{
  int i = 0;
  int idx;

    while (i^256)
    {
      V[i++]   = X[0];
       V[i++]   = X[1];      
        neoscrypt_blkmix(X, flag);
    }
    do {      
        idx = (( (uint *)X)[48] & 0x7F) << 1;
       X[0] ^= V[idx];
       X[1] ^= V[idx+1];
        neoscrypt_blkmix(X, flag);    
    }   while (i-=2);
}
 

sr. member
Activity: 539
Merit: 255
.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....  Smiley

change the XORBytesInPlace call from
Code:
XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);
to
Code:
      XORBytesInPlace(B + bufidx, input, bufidx);
and change the function itself to perform some byte alignment checking
Code:
//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
  switch(mod % 4)
  {
  case 0:
    #pragma unroll 2
    for(int i = 0; i < 4; i+=2)
    {
      ((uint2 *)dst)[i]   ^= ((uint2 *)src)[i];
        ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];   
    }
    break;   

  case 2: 
    #pragma unroll 8
    for(int i = 0; i < 16; i+=2)
    {
      ((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
      ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
    }
    break;

  default:
  #pragma unroll 8
   for(int i = 0; i < 31; i+=4)
   {
    ((uchar *)dst)[i] ^= ((uchar *)src)[i];
    ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
    ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
    ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];   
    }
  }
}


This actually drops the hashrate by 5 kh/s on the 290's, but combined with the bobben2 mod increases by about 9 kh/s.

YMMV
member
Activity: 158
Merit: 10
.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....  Smiley

change the XORBytesInPlace call from
Code:
XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);
to
Code:
      XORBytesInPlace(B + bufidx, input, bufidx);
and change the function itself to perform some byte alignment checking
Code:
//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
  switch(mod % 4)
  {
  case 0:
    #pragma unroll 2
    for(int i = 0; i < 4; i+=2)
    {
      ((uint2 *)dst)[i]   ^= ((uint2 *)src)[i];
        ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];   
    }
    break;   

  case 2: 
    #pragma unroll 8
    for(int i = 0; i < 16; i+=2)
    {
      ((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
      ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
    }
    break;

  default:
  #pragma unroll 8
   for(int i = 0; i < 31; i+=4)
   {
    ((uchar *)dst)[i] ^= ((uchar *)src)[i];
    ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
    ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
    ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];   
    }
  }
}

What settings do you use?
hero member
Activity: 896
Merit: 1000
.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....  Smiley

change the XORBytesInPlace call from
Code:
XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);
to
Code:
      XORBytesInPlace(B + bufidx, input, bufidx);
and change the function itself to perform some byte alignment checking
Code:
//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
  switch(mod % 4)
  {
  case 0:
    #pragma unroll 2
    for(int i = 0; i < 4; i+=2)
    {
      ((uint2 *)dst)[i]   ^= ((uint2 *)src)[i];
        ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];   
    }
    break;   

  case 2: 
    #pragma unroll 8
    for(int i = 0; i < 16; i+=2)
    {
      ((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
      ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
    }
    break;

  default:
  #pragma unroll 8
   for(int i = 0; i < 31; i+=4)
   {
    ((uchar *)dst)[i] ^= ((uchar *)src)[i];
    ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
    ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
    ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];   
    }
  }
}


Did you change from the original kernal or after boben2's change? Can you upload a revised kernal?
member
Activity: 81
Merit: 1002
It was only the wind.
Thank's to Wolf0 for your .bin files

increas from 4.3 to 6.6 MHs
x11 r9 280x



but profit same as it was get 4.3 MHs
maybe diff raise like crazy

hopefully get a special .bin that not make a crazy diff  Wink

Haha, that's called a bin that's not released. Tongue
newbie
Activity: 18
Merit: 0
.....This is worth 20KH/s on my 280X......from 343KHs to 363KH/s at 1020MHz clock
.....now somebody needs to find 20KH/s more for me....  Smiley

change the XORBytesInPlace call from
Code:
XORBytesInPlace(B + bufidx, input, BLAKE2S_OUT_SIZE);
to
Code:
      XORBytesInPlace(B + bufidx, input, bufidx);
and change the function itself to perform some byte alignment checking
Code:
//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
  switch(mod % 4)
  {
  case 0:
    #pragma unroll 2
    for(int i = 0; i < 4; i+=2)
    {
      ((uint2 *)dst)[i]   ^= ((uint2 *)src)[i];
        ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];   
    }
    break;   

  case 2: 
    #pragma unroll 8
    for(int i = 0; i < 16; i+=2)
    {
      ((uchar2 *)dst)[i] ^= ((uchar2 *)src)[i];
      ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
    }
    break;

  default:
  #pragma unroll 8
   for(int i = 0; i < 31; i+=4)
   {
    ((uchar *)dst)[i] ^= ((uchar *)src)[i];
    ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
    ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
    ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];   
    }
  }
}
sr. member
Activity: 539
Merit: 255
Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s  at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.

You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.

Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}

void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}

To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}


CORRECTION:

That made a 8 kh/s increase on my 290's.. from 341 to 349 kh/s. (dumb azz me, I forgot to delete the bin)
newbie
Activity: 51
Merit: 0
Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s  at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.

You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.

Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}

void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}

To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}


Not working well with Wolf0's Hawaii mod. Hash rate dropped from 339kh/s to 320kh/s.
member
Activity: 158
Merit: 10
Here is a small neoscrypt kernel improvement for free, since I am mostly doing X11 anyway.
It gave me a 5.8% speedup on my reference R9 290 card (with Stilt bios),
from 290.2 to 307Kh/s  at 800/1500 core/mem freq on Ubuntu 12.04 with stock drivers.
I didnt try it on my R9 280x cards, so please post your results if you try this.

You will have to mod the kernel as per the code below.
The bottleneck in this kernel is the way it stores the 128 intermediate results of chacha and salsa in global memory.
By doing the change below you are reducing stalls/latency by not making read/writes to same/adjacent memory banks.

Change:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx << 1] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[(idx << 1) + 1] = ((ulong16 *)X)[1];
}

void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx << 1];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[(idx << 1) + 1];
}

To:
void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}

thanks, increase from 317 to 324 on 290x
hero member
Activity: 896
Merit: 1000
Hi again,
Now I tried the "improved" kernel on my own 280X rig.
3 cards, all running 1000/1500 core/mem.  550Watts at the wall  
Orig neoscrypt kernel (Kh/s)
  301
  296
  287
My "improved" kernel
  295
  289
  276
Yiikes!  I got worse performance on the 280X!  
Sorry guys.  This "improvement", as it stands, seems to come to the 290 only.


+1
on my 280x i get 308 with "improved" vs 317 before

The drop is about 5% compared to old kernel for 7970. Maybe this "improved kernel" works only for 290 which has larger memory and more cores.
sr. member
Activity: 539
Merit: 255
So wolf0 isn't the only one with a kernel mod that works better on the 290's than anything else..
hero member
Activity: 935
Merit: 1001
I don't always drink...
I concur.  After further testing with suggested settings I was getting HW errors.  Back to the drawing board...
full member
Activity: 279
Merit: 104
Hi again,
Now I tried the "improved" kernel on my own 280X rig.
3 cards, all running 1000/1500 core/mem.  550Watts at the wall 
Orig neoscrypt kernel (Kh/s)
  301
  296
  287
My "improved" kernel
  295
  289
  276
Yiikes!  I got worse performance on the 280X! 
Sorry guys.  This "improvement", as it stands, seems to come to the 290 only.
newbie
Activity: 57
Merit: 0
Try to remove thread-concurrency from config, so sgminer can calculate it from the xintensity.

Edit: Oh, and worksize 128  or even 64 is probably faster.
Jump to: