Pages:
Author

Topic: [ANN][X11/X13] X11 (Darkcoin)/X13 (Marucoin) miner (based on sph-sgminer) (Read 351507 times)

member
Activity: 81
Merit: 1002
It was only the wind.
latest x13 optimisations are not compatible with 6xxx and 5xxx radeons
Best way is to run with -d switch - pointing to your 7970 and another instance using x13modold/marucoin-modold , again with -d but pointing to 6950 card
Best speed achieved by 6970 is ~ 1.4MH/s for x13
I'm mining with 280x using sgminer 4.2.2-298-g3bb4  with wolf and got 8,2Mh/s  for single card and  34 Mh/s with 4  280x
here my bat.file for single card.     sgminer.exe --kernel darkcoin-mod --api-listen -o stratum+tcp://cann.suprnova.cc:4442 -u xxxx -p xxxxx -w 64 -g 2 --thread-concurrency 8192 --intensity 21 --lookup-gap 2 --no-submit-stale --gpu-powertune 20 --gpu-fan 55 --temp-cutoff 95 --gpu-engine 1150 --gpu-memclock 1450
From sgminer screen
sgminer 4.2.2-298-g3bb4 - Started: [2015-02-06 01:21:49] - [0 days 00:24:53]
--------------------------------------------------------------------------------
(5s):8.017M (avg):6.683Mh/s | A:2  R:0  HW:0  WU:0.094/m
ST: 2  SS: 5  NB: 25  LW: 1557  GF: 0  RF: 0
Connected to cann.suprnova.cc (stratum) diff 0.022 as user xxxxxxxx
Block: 4bd89bcc...  Diff:37  Started: [01:46:42] .
can you link that kernel please?

He probably bought it, so he might not.
member
Activity: 81
Merit: 1002
It was only the wind.
as french, I like revolutions so whose head shall we cut ? Grin

x11 is like a banana republic, there is a revolution every now and then.
Really want to revolutionize things ? rewrite from scratch sgminer.

an advice: stop mistaking revolution and pissing contest...

God, SGMiner DOES need rewriting...

What will be the advantages to re-writing SGMiner?

You've obviously never seen the code. One thing that holds back development is that no one wants to touch it. CGMiner was bad enough, and that is horrid - now we have more misfit pieces of code grafted onto it. It works, but that's about the only good thing you can say about that code.

Everybody use it because, as far as I know, it's the best at handling OpenCL (so AMD cards) ;-)

No, everybody uses it because there's no reasonable alternative. It's not the best at OpenCL, it's the ONLY one - otherwise you must write it from scratch.
member
Activity: 81
Merit: 1002
It was only the wind.
as french, I like revolutions so whose head shall we cut ? Grin

x11 is like a banana republic, there is a revolution every now and then.
Really want to revolutionize things ? rewrite from scratch sgminer.

an advice: stop mistaking revolution and pissing contest...

God, SGMiner DOES need rewriting...

What will be the advantages to re-writing SGMiner?

You've obviously never seen the code. One thing that holds back development is that no one wants to touch it. CGMiner was bad enough, and that is horrid - now we have more misfit pieces of code grafted onto it. It works, but that's about the only good thing you can say about that code.
newbie
Activity: 8
Merit: 0
I see this as less of a problem. Let say he has 1% of the overall network hashrate (180 khash/s), he could only send those spammy transactions in 1/100 of the blocks found. This also assumes everyone has updated to the latest code.
legendary
Activity: 2716
Merit: 1094
Black Belt Developer
help why am i geting extreme LW what dus this mean is it mining for someone elss please help

this miner is obsolete, please use sgminer and the related thread:

https://bitcointalksearch.org/topic/ann-sgminer-v5-optimized-x11x13neoscryptlyra2reetc-kernel-switch-miner-632503
newbie
Activity: 1
Merit: 0
help why am i geting extreme LW what dus this mean is it mining for someone elss please help
hero member
Activity: 687
Merit: 502
I am considering throwing a few old GPU rigs at x13 (15-20 R9 280/290x cards) is there a stupid simple miner I can run?
With the miner linked in this article i get like 13.5-14Mhs with some oc while mining Quark @ Nicehash with my 280x cards.
Either that or mining ETH should give you some pretty descent profits.

http://cryptomining-blog.com/4819-new-sgminer-with-optimized-quark-and-qubit-kernels/
member
Activity: 110
Merit: 10
please forgive, but this may be dumb question.
Can i use that sgminer and run on my gridseeds?
if so what is batch file look like?
hero member
Activity: 896
Merit: 1000
I am considering throwing a few old GPU rigs at x13 (15-20 R9 280/290x cards) is there a stupid simple miner I can run?

SGminer is simple and most popular.
full member
Activity: 224
Merit: 100
I am considering throwing a few old GPU rigs at x13 (15-20 R9 280/290x cards) is there a stupid simple miner I can run?
legendary
Activity: 1792
Merit: 1008
/dev/null
latest x13 optimisations are not compatible with 6xxx and 5xxx radeons
Best way is to run with -d switch - pointing to your 7970 and another instance using x13modold/marucoin-modold , again with -d but pointing to 6950 card
Best speed achieved by 6970 is ~ 1.4MH/s for x13
I'm mining with 280x using sgminer 4.2.2-298-g3bb4  with wolf and got 8,2Mh/s  for single card and  34 Mh/s with 4  280x
here my bat.file for single card.     sgminer.exe --kernel darkcoin-mod --api-listen -o stratum+tcp://cann.suprnova.cc:4442 -u xxxx -p xxxxx -w 64 -g 2 --thread-concurrency 8192 --intensity 21 --lookup-gap 2 --no-submit-stale --gpu-powertune 20 --gpu-fan 55 --temp-cutoff 95 --gpu-engine 1150 --gpu-memclock 1450
From sgminer screen
sgminer 4.2.2-298-g3bb4 - Started: [2015-02-06 01:21:49] - [0 days 00:24:53]
--------------------------------------------------------------------------------
(5s):8.017M (avg):6.683Mh/s | A:2  R:0  HW:0  WU:0.094/m
ST: 2  SS: 5  NB: 25  LW: 1557  GF: 0  RF: 0
Connected to cann.suprnova.cc (stratum) diff 0.022 as user xxxxxxxx
Block: 4bd89bcc...  Diff:37  Started: [01:46:42] .
can you link that kernel please?
sr. member
Activity: 308
Merit: 250
Millionaires Club 47
latest x13 optimisations are not compatible with 6xxx and 5xxx radeons
Best way is to run with -d switch - pointing to your 7970 and another instance using x13modold/marucoin-modold , again with -d but pointing to 6950 card
Best speed achieved by 6970 is ~ 1.4MH/s for x13
I'm mining with 280x using sgminer 4.2.2-298-g3bb4  with wolf and got 8,2Mh/s  for single card and  34 Mh/s with 4  280x
here my bat.file for single card.     sgminer.exe --kernel darkcoin-mod --api-listen -o stratum+tcp://cann.suprnova.cc:4442 -u xxxx -p xxxxx -w 64 -g 2 --thread-concurrency 8192 --intensity 21 --lookup-gap 2 --no-submit-stale --gpu-powertune 20 --gpu-fan 55 --temp-cutoff 95 --gpu-engine 1150 --gpu-memclock 1450
From sgminer screen
sgminer 4.2.2-298-g3bb4 - Started: [2015-02-06 01:21:49] - [0 days 00:24:53]
--------------------------------------------------------------------------------
(5s):8.017M (avg):6.683Mh/s | A:2  R:0  HW:0  WU:0.094/m
ST: 2  SS: 5  NB: 25  LW: 1557  GF: 0  RF: 0
Connected to cann.suprnova.cc (stratum) diff 0.022 as user xxxxxxxx
Block: 4bd89bcc...  Diff:37  Started: [01:46:42] .
sr. member
Activity: 434
Merit: 250
What are currently the best hashrates one can get with 7950 or 280x ?

I don't know about the 7950, but the 280x is about 6.6mhs with a overclock. This is on Windows 7 or 8 O/S. Don't know about linux distros.

You need to use wolf0's old modded kernel and bins leaked by LovesToShare on November 30: http://www.filedropper.com/optmizedsgminerkernels

I see you posted on the other thread as well: https://bitcointalk.org/index.php?topic=854257.320

X11 is 6.6mhs overclocked wolf0 screenshot
X13 I don't know, but a R9 290 is 5.1mhs not overclocked, my own card.

There is modded kernel for neoscrypt that give extra 4% on 280x, again from WolfO.

Copy and past this replacement into the neoscrypt kernel file (delete the old contents).

snip of codes

That is all I found myself.
Please remember, you buy better mods directly from wolf0. He has X13 mod for sale that gives another 50% boost to X13 algo hash. He not selling his latest neoscrypt algo.  


I tried your neoscrypt kernel file, it does not compile on my GPU. Can you upload a bin file?

Which sgminer version do you use? Link?
legendary
Activity: 1092
Merit: 1004
What are currently the best hashrates one can get with 7950 or 280x ?

I don't know about the 7950, but the 280x is about 6.6mhs with a overclock. This is on Windows 7 or 8 O/S. Don't know about linux distros.

You need to use wolf0's old modded kernel and bins leaked by LovesToShare on November 30: http://www.filedropper.com/optmizedsgminerkernels

I see you posted on the other thread as well: https://bitcointalk.org/index.php?topic=854257.320

X11 is 6.6mhs overclocked wolf0 screenshot
X13 I don't know, but a R9 290 is 5.1mhs not overclocked, my own card.

There is modded kernel for neoscrypt that give extra 4% on 280x, again from WolfO.

Copy and past this replacement into the neoscrypt kernel file (delete the old contents).

// NeoScrypt(128, 2, 1) with Salsa20/20 and ChaCha20/20

// Stupid AMD compiler ignores the unroll pragma in these two
#define SALSA_SMALL_UNROLL 3
#define CHACHA_SMALL_UNROLL 3

// If SMALL_BLAKE2S is defined, BLAKE2S_UNROLL is interpreted
// as the unroll factor; must divide cleanly into ten.
// Usually a bad idea.
//#define SMALL_BLAKE2S
//#define BLAKE2S_UNROLL 5

#define BLOCK_SIZE           64U
#define FASTKDF_BUFFER_SIZE 256U
#ifndef PASSWORD_LEN
#define PASSWORD_LEN         80U
#endif

#if !defined(cl_khr_byte_addressable_store)
#error "Device does not support unaligned stores"
#endif

// Swaps 128 bytes at a time without using temp vars
void SwapBytes128(void *restrict A, void *restrict B, uint len)
{
    #pragma unroll 2
    for(int i = 0; i < (len >> 7); ++i)
    {
        ((ulong16 *)A) ^= ((ulong16 *)B);
        ((ulong16 *)B) ^= ((ulong16 *)A);
        ((ulong16 *)A) ^= ((ulong16 *)B);
    }
}

void CopyBytes128(void *restrict dst, const void *restrict src, uint len)
{
    #pragma unroll 2
    for(int i = 0; i < len; ++i)
        ((ulong16 *)dst) = ((ulong16 *)src);
}

void CopyBytes(void *restrict dst, const void *restrict src, uint len)
{
    for(int i = 0; i < len; ++i)
        ((uchar *)dst) = ((uchar *)src);
}

//
// a bit of byte alignment checking goes a long ways...
//
void XORBytesInPlace(void *restrict dst, const void *restrict src, uint mod)
{
  switch(mod % 4)
  {
  case 0:
    #pragma unroll 2
    for(int i = 0; i < 4; i+=2)
    {
          ((uint2 *)dst)   ^= ((uint2 *)src);
           ((uint2 *)dst)[i+1] ^= ((uint2 *)src)[i+1];    
    }
    break;    

  case 2:  
    #pragma unroll 8
    for(int i = 0; i < 16; i+=2)
    {
          ((uchar2 *)dst) ^= ((uchar2 *)src);
          ((uchar2 *)dst)[i+1] ^= ((uchar2 *)src)[i+1];
    }
    break;

  default:
  #pragma unroll 8
   for(int i = 0; i < 31; i+=4)
   {
        ((uchar *)dst) ^= ((uchar *)src);
        ((uchar *)dst)[i+1] ^= ((uchar *)src)[i+1];
        ((uchar *)dst)[i+2] ^= ((uchar *)src)[i+2];
        ((uchar *)dst)[i+3] ^= ((uchar *)src)[i+3];  
    }
  }
}

void XORBytes(void *restrict dst, const void *restrict src1, const void *restrict src2, uint len)
{
    #pragma unroll 1
    for(int i = 0; i < len; ++i)
        ((uchar *)dst) = ((uchar *)src1) ^ ((uchar *)src2);
}

// Blake2S

#define BLAKE2S_BLOCK_SIZE    64U
#define BLAKE2S_OUT_SIZE      32U
#define BLAKE2S_KEY_SIZE      32U

static const __constant uint BLAKE2S_IV[8] =
{
    0x6A09E667, 0xBB67AE85, 0x3C6EF372, 0xA54FF53A,
    0x510E527F, 0x9B05688C, 0x1F83D9AB, 0x5BE0CD19
};

static const __constant uchar BLAKE2S_SIGMA[10][16] =
{
    {  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15 } ,
    { 14, 10,  4,  8,  9, 15, 13,  6,  1, 12,  0,  2, 11,  7,  5,  3 } ,
    { 11,  8, 12,  0,  5,  2, 15, 13, 10, 14,  3,  6,  7,  1,  9,  4 } ,
    {  7,  9,  3,  1, 13, 12, 11, 14,  2,  6,  5, 10,  4,  0, 15,  8 } ,
    {  9,  0,  5,  7,  2,  4, 10, 15, 14,  1, 11, 12,  6,  8,  3, 13 } ,
    {  2, 12,  6, 10,  0, 11,  8,  3,  4, 13,  7,  5, 15, 14,  1,  9 } ,
    { 12,  5,  1, 15, 14, 13,  4, 10,  0,  7,  6,  3,  9,  2,  8, 11 } ,
    { 13, 11,  7, 14, 12,  1,  3,  9,  5,  0, 15,  4,  8,  6,  2, 10 } ,
    {  6, 15, 14,  9, 11,  3,  0,  8, 12,  2, 13,  7,  1,  4, 10,  5 } ,
    { 10,  2,  8,  4,  7,  6,  1,  5, 15, 11,  9, 14,  3, 12, 13 , 0 } ,
};

#define BLAKE_G(idx0, idx1, a, b, c, d, key)    do { \
    a += b + key[BLAKE2S_SIGMA[idx0][idx1]]; \
    d = rotate(d ^ a, 16U); \
    c += d; \
    b = rotate(b ^ c, 20U); \
    a += b + key[BLAKE2S_SIGMA[idx0][idx1 + 1]]; \
    d = rotate(d ^ a, 24U); \
    c += d; \
    b = rotate(b ^ c, 25U); \
} while(0)

void Blake2S(uint *restrict inout, const uint *restrict inkey)
{
    uint16 V;
    uint8 tmpblock;
    
    // Load first block (IV into V.lo) and constants (IV into V.hi)
    V.lo = V.hi = vload8(0U, BLAKE2S_IV);
    
    // XOR with initial constant
    V.s0 ^= 0x01012020;
    
    // Copy input block for later
    tmpblock = V.lo;
    
    // XOR length of message so far (including this block)
    // There are two uints for this field, but high uint is zero
    V.sc ^= BLAKE2S_BLOCK_SIZE;
    
    // Compress state, using the key as the key
    #ifdef SMALL_BLAKE2S
    #pragma unroll BLAKE2S_UNROLL
    #else
    #pragma unroll
    #endif
    for(int x = 0; x < 10; ++x)
    {
        BLAKE_G(x, 0x00, V.s0, V.s4, V.s8, V.sc, inkey);
        BLAKE_G(x, 0x02, V.s1, V.s5, V.s9, V.sd, inkey);
        BLAKE_G(x, 0x04, V.s2, V.s6, V.sa, V.se, inkey);
        BLAKE_G(x, 0x06, V.s3, V.s7, V.sb, V.sf, inkey);
        BLAKE_G(x, 0x08, V.s0, V.s5, V.sa, V.sf, inkey);
        BLAKE_G(x, 0x0A, V.s1, V.s6, V.sb, V.sc, inkey);
        BLAKE_G(x, 0x0C, V.s2, V.s7, V.s8, V.sd, inkey);
        BLAKE_G(x, 0x0E, V.s3, V.s4, V.s9, V.se, inkey);
    }
    
    // XOR low part of state with the high part,
    // then with the original input block.
    V.lo ^= V.hi ^ tmpblock;
    
    // Load constants (IV into V.hi)
    V.hi = vload8(0U, BLAKE2S_IV);
    
    // Copy input block for later
    tmpblock = V.lo;
    
    // XOR length of message into block again
    V.sc ^= BLAKE2S_BLOCK_SIZE << 1;
    
    // Last block compression - XOR final constant into state
    V.se ^= 0xFFFFFFFFU;
    
    // Compress block, using the input as the key
    #ifdef SMALL_BLAKE2S
    #pragma unroll BLAKE2S_UNROLL
    #else
    #pragma unroll
    #endif
    for(int x = 0; x < 10; ++x)
    {
        BLAKE_G(x, 0x00, V.s0, V.s4, V.s8, V.sc, inout);
        BLAKE_G(x, 0x02, V.s1, V.s5, V.s9, V.sd, inout);
        BLAKE_G(x, 0x04, V.s2, V.s6, V.sa, V.se, inout);
        BLAKE_G(x, 0x06, V.s3, V.s7, V.sb, V.sf, inout);
        BLAKE_G(x, 0x08, V.s0, V.s5, V.sa, V.sf, inout);
        BLAKE_G(x, 0x0A, V.s1, V.s6, V.sb, V.sc, inout);
        BLAKE_G(x, 0x0C, V.s2, V.s7, V.s8, V.sd, inout);
        BLAKE_G(x, 0x0E, V.s3, V.s4, V.s9, V.se, inout);
    }
    
    // XOR low part of state with high part, then with input block
    V.lo ^= V.hi ^ tmpblock;
    
    // Store result in input/output buffer
    vstore8(V.lo, 0, inout);
}

/* FastKDF, a fast buffered key derivation function:
 * FASTKDF_BUFFER_SIZE must be a power of 2;
 * password_len, salt_len and output_len should not exceed FASTKDF_BUFFER_SIZE;
 * prf_output_size must be <= prf_key_size; */
void fastkdf(const uchar *restrict password, const uchar *restrict salt, const uint salt_len, uchar *restrict

output, uint output_len)
{

    /*                    WARNING!
     * This algorithm uses byte-wise addressing for memory blocks.
     * Or in other words, trying to copy an unaligned memory region
     * will significantly slow down the algorithm, when copying uses
     * words or bigger entities. It even may corrupt the data, when
     * the device does not support it properly.
     * Therefore use byte copying, which will not the fastest but at
     * least get reliable results. */

    // BLOCK_SIZE            64U
    // FASTKDF_BUFFER_SIZE  256U
    // BLAKE2S_BLOCK_SIZE    64U
    // BLAKE2S_KEY_SIZE      32U
    // BLAKE2S_OUT_SIZE      32U
    uchar bufidx = 0;
    uint8 Abuffer[9], Bbuffer[9] = { (uint8)(0) };
    uchar *A = (uchar *)Abuffer, *B = (uchar *)Bbuffer;
        
    // Initialize the password buffer
    #pragma unroll 1
    for(int i = 0; i < (FASTKDF_BUFFER_SIZE >> 3); ++i) ((ulong *)A) = ((ulong *)password)[i % 10];
    
    ((uint16 *)(A + FASTKDF_BUFFER_SIZE))[0] = ((uint16 *)password)[0];

    // Initialize the salt buffer
    if(salt_len == FASTKDF_BUFFER_SIZE)
    {
        ((ulong16 *)B)[0] = ((ulong16 *)B)[2] = ((ulong16 *)salt)[0];
        ((ulong16 *)B)[1] = ((ulong16 *)B)[3] = ((ulong16 *)salt)[1];
    }
    else
    {
        // salt_len is 80 bytes here
        #pragma unroll 1
        for(int i = 0; i < (FASTKDF_BUFFER_SIZE >> 3); ++i) ((ulong *)B) = ((ulong *)salt)[i % 10];
                
        // Initialized the rest to zero earlier
        #pragma unroll 1
        for(int i = 0; i < 10; ++i) ((ulong *)(B + FASTKDF_BUFFER_SIZE)) = ((ulong *)salt);
    }
    
    // The primary iteration
    #pragma unroll 1
    for(int i = 0; i < 32; ++i)
    {
        // Make the key buffer twice the size of the key so it fits a Blake2S block
        // This way, we don't need a temp buffer in the Blake2S function.
        uchar input[BLAKE2S_BLOCK_SIZE], key[BLAKE2S_BLOCK_SIZE] = { 0 };
        
        // Copy input and key to their buffers
        CopyBytes(input, A + bufidx, BLAKE2S_BLOCK_SIZE);
        CopyBytes(key, B + bufidx, BLAKE2S_KEY_SIZE);
        
        // PRF
        Blake2S((uint *)input, (uint *)key);

        // Calculate the next buffer pointer
        bufidx = 0;
        
        for(int x = 0; x < BLAKE2S_OUT_SIZE; ++x)
            bufidx += input
  • ;
       
        // bufidx a uchar now - always mod 255
        //bufidx &= (FASTKDF_BUFFER_SIZE - 1);
        
        // Modify the salt buffer        
        XORBytesInPlace(B + bufidx, input, bufidx);
        
        if(bufidx < BLAKE2S_KEY_SIZE)
        {
            // Head modified, tail updated
            // this was made off the original code... wtf
            //CopyBytes(B + FASTKDF_BUFFER_SIZE + bufidx, B + bufidx, min(BLAKE2S_OUT_SIZE, BLAKE2S_KEY_SIZE -

bufidx));
            CopyBytes(B + FASTKDF_BUFFER_SIZE + bufidx, B + bufidx, BLAKE2S_KEY_SIZE - bufidx);
        }
        else if((FASTKDF_BUFFER_SIZE - bufidx) < BLAKE2S_OUT_SIZE)
        {
            // Tail modified, head updated
            CopyBytes(B, B + FASTKDF_BUFFER_SIZE, BLAKE2S_OUT_SIZE - (FASTKDF_BUFFER_SIZE - bufidx));
        }
    }

    // Modify and copy into the output buffer
    
    // Damned compiler crashes
    // Fuck you, AMD
    
    //for(uint i = 0; i < output_len; ++i, ++bufidx)
    //    output = B[bufidx] ^ A;
    
    uint left = FASTKDF_BUFFER_SIZE - bufidx;
    //uint left = (~bufidx) + 1
    
    if(left < output_len)
    {        
        XORBytes(output, B + bufidx, A, left);
        XORBytes(output + left, B, A + left, output_len - left);
    }
    else
    {
        XORBytes(output, B + bufidx, A, output_len);
    }    
}

#define SALSA_CORE(state)    do { \
    state.s4 ^= rotate(state.s0 + state.sc, 7U); state.s8 ^= rotate(state.s4 + state.s0, 9U); state.sc ^=

rotate(state.s8 + state.s4, 13U); state.s0 ^= rotate(state.sc + state.s8, 18U); \
    state.s9 ^= rotate(state.s5 + state.s1, 7U); state.sd ^= rotate(state.s9 + state.s5, 9U); state.s1 ^=

rotate(state.sd + state.s9, 13U); state.s5 ^= rotate(state.s1 + state.sd, 18U); \
    state.se ^= rotate(state.sa + state.s6, 7U); state.s2 ^= rotate(state.se + state.sa, 9U); state.s6 ^=

rotate(state.s2 + state.se, 13U); state.sa ^= rotate(state.s6 + state.s2, 18U); \
    state.s3 ^= rotate(state.sf + state.sb, 7U); state.s7 ^= rotate(state.s3 + state.sf, 9U); state.sb ^=

rotate(state.s7 + state.s3, 13U); state.sf ^= rotate(state.sb + state.s7, 18U); \
    state.s1 ^= rotate(state.s0 + state.s3, 7U); state.s2 ^= rotate(state.s1 + state.s0, 9U); state.s3 ^=

rotate(state.s2 + state.s1, 13U); state.s0 ^= rotate(state.s3 + state.s2, 18U); \
    state.s6 ^= rotate(state.s5 + state.s4, 7U); state.s7 ^= rotate(state.s6 + state.s5, 9U); state.s4 ^=

rotate(state.s7 + state.s6, 13U); state.s5 ^= rotate(state.s4 + state.s7, 18U); \
    state.sb ^= rotate(state.sa + state.s9, 7U); state.s8 ^= rotate(state.sb + state.sa, 9U); state.s9 ^=

rotate(state.s8 + state.sb, 13U); state.sa ^= rotate(state.s9 + state.s8, 18U); \
    state.sc ^= rotate(state.sf + state.se, 7U); state.sd ^= rotate(state.sc + state.sf, 9U); state.se ^=

rotate(state.sd + state.sc, 13U); state.sf ^= rotate(state.se + state.sd, 18U); \
} while(0)

uint16 salsa_small_scalar_rnd(uint16 X)
{
    uint16 st = X;
    
    #if SALSA_SMALL_UNROLL == 1
    
    for(int i = 0; i < 10; ++i)
    {
        SALSA_CORE(st);
    }
    
    #elif SALSA_SMALL_UNROLL == 2
    
    for(int i = 0; i < 5; ++i)
    {
        SALSA_CORE(st);
        SALSA_CORE(st);
    }
    
    #elif SALSA_SMALL_UNROLL == 3
    
    for(int i = 0; i < 4; ++i)
    {
        SALSA_CORE(st);
        if(i == 3) break;
        SALSA_CORE(st);
        SALSA_CORE(st);
    }
    
    #elif SALSA_SMALL_UNROLL == 4
    
    for(int i = 0; i < 3; ++i)
    {
        SALSA_CORE(st);
        SALSA_CORE(st);
        if(i == 2) break;
        SALSA_CORE(st);
        SALSA_CORE(st);
    }
    
    #else
    
    for(int i = 0; i < 2; ++i)
    {
        SALSA_CORE(st);
        SALSA_CORE(st);
        SALSA_CORE(st);
        SALSA_CORE(st);
        SALSA_CORE(st);
    }
    
    #endif
    
    return(X + st);
}

#define CHACHA_CORE_PARALLEL(state)    do { \
    state[0] += state[1]; state[3] = rotate(state[3] ^ state[0], (uint4)(16U, 16U, 16U, 16U)); \
    state[2] += state[3]; state[1] = rotate(state[1] ^ state[2], (uint4)(12U, 12U, 12U, 12U)); \
    state[0] += state[1]; state[3] = rotate(state[3] ^ state[0], (uint4)(8U, 8U, 8U, 8U)); \
    state[2] += state[3]; state[1] = rotate(state[1] ^ state[2], (uint4)(7U, 7U, 7U, 7U)); \
    \
    state[0] += state[1].yzwx; state[3].wxyz = rotate(state[3].wxyz ^ state[0], (uint4)(16U, 16U, 16U, 16U));

\
    state[2].zwxy += state[3].wxyz; state[1].yzwx = rotate(state[1].yzwx ^ state[2].zwxy, (uint4)(12U, 12U,

12U, 12U)); \
    state[0] += state[1].yzwx; state[3].wxyz = rotate(state[3].wxyz ^ state[0], (uint4)(8U, 8U, 8U, 8U)); \
    state[2].zwxy += state[3].wxyz; state[1].yzwx = rotate(state[1].yzwx ^ state[2].zwxy, (uint4)(7U, 7U, 7U,

7U)); \
} while(0)

uint16 chacha_small_parallel_rnd(uint16 X)
{
    uint4 t, st[4];
    
    ((uint16 *)st)[0] = X;
    
    #if CHACHA_SMALL_UNROLL == 1
    
    for(int i = 0; i < 10; ++i)
    {
        CHACHA_CORE_PARALLEL(st);
    }
    
    #elif CHACHA_SMALL_UNROLL == 2
    
    for(int i = 0; i < 5; ++i)
    {
        CHACHA_CORE_PARALLEL(st);
        CHACHA_CORE_PARALLEL(st);
    }
    
    #elif CHACHA_SMALL_UNROLL == 3
    
    for(int i = 0; i < 4; ++i)
    {
        CHACHA_CORE_PARALLEL(st);
        if(i == 3) break;
        CHACHA_CORE_PARALLEL(st);
        CHACHA_CORE_PARALLEL(st);
    }
    
    #elif CHACHA_SMALL_UNROLL == 4
    
    for(int i = 0; i < 3; ++i)
    {
        CHACHA_CORE_PARALLEL(st);
        CHACHA_CORE_PARALLEL(st);
        if(i == 2) break;
        CHACHA_CORE_PARALLEL(st);
        CHACHA_CORE_PARALLEL(st);
    }
    
    #else
    
    for(int i = 0; i < 2; ++i)
    {
        CHACHA_CORE_PARALLEL(st);
        CHACHA_CORE_PARALLEL(st);
        CHACHA_CORE_PARALLEL(st);
        CHACHA_CORE_PARALLEL(st);
        CHACHA_CORE_PARALLEL(st);
    }
    
    #endif
    
    return(X + ((uint16 *)st)[0]);
}

void neoscrypt_blkmix(uint16 *XV, bool alg)
{

    /* NeoScrypt flow:                   Scrypt flow:
         Xa ^= Xd;  M(Xa'); Ya = Xa";      Xa ^= Xb;  M(Xa'); Ya = Xa";
         Xb ^= Xa"; M(Xb'); Yb = Xb";      Xb ^= Xa"; M(Xb'); Yb = Xb";
         Xc ^= Xb"; M(Xc'); Yc = Xc";      Xa" = Ya;
         Xd ^= Xc"; M(Xd'); Yd = Xd";      Xb" = Yb;
         Xa" = Ya; Xb" = Yc;
         Xc" = Yb; Xd" = Yd; */
    
    XV[0] ^= XV[3];
    
    if(!alg)
    {
        XV[0] = salsa_small_scalar_rnd(XV[0]); XV[1] ^= XV[0];
        XV[1] = salsa_small_scalar_rnd(XV[1]); XV[2] ^= XV[1];
        XV[2] = salsa_small_scalar_rnd(XV[2]); XV[3] ^= XV[2];
        XV[3] = salsa_small_scalar_rnd(XV[3]);
    }
    else
    {
        XV[0] = chacha_small_parallel_rnd(XV[0]); XV[1] ^= XV[0];
        XV[1] = chacha_small_parallel_rnd(XV[1]); XV[2] ^= XV[1];
        XV[2] = chacha_small_parallel_rnd(XV[2]); XV[3] ^= XV[2];
        XV[3] = chacha_small_parallel_rnd(XV[3]);
    }
    
    XV[1] ^= XV[2];
    XV[2] ^= XV[1];
    XV[1] ^= XV[2];
}

void ScratchpadStore(__global void *V, void *X, uchar idx)
{
   ((__global ulong16 *)V)[idx] = ((ulong16 *)X)[0];
   ((__global ulong16 *)V)[idx + 128] = ((ulong16 *)X)[1];
}
void ScratchpadMix(void *X, const __global void *V, uchar idx)
{
   ((ulong16 *)X)[0] ^= ((__global ulong16 *)V)[idx];
   ((ulong16 *)X)[1] ^= ((__global ulong16 *)V)[idx + 128];
}

void SMix(uint16 *X, __global uint16 *V, bool flag)
{
    #pragma unroll 1
    for(int i = 0; i < 128; ++i)
    {
        ScratchpadStore(V, X, i);
        neoscrypt_blkmix(X, flag);
    }
    
    #pragma unroll 1
    for(int i = 0; i < 128; ++i)
    {
        const uint idx = convert_uchar(((uint *)X)[48] & 0x7F);
        ScratchpadMix(X, V, idx);
        neoscrypt_blkmix(X, flag);
    }
}

__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
__kernel void search(__global const uchar* restrict input, __global uint* restrict output, __global uchar

*padcache, const uint target)
{
#define CONSTANT_N 128
#define CONSTANT_r 2
    // X = CONSTANT_r * 2 * BLOCK_SIZE(64); Z is a copy of X for ChaCha
    uint16 X[4], Z[4];
    /* V = CONSTANT_N * CONSTANT_r * 2 * BLOCK_SIZE */
    __global ulong16 *V = (__global ulong16 *)(padcache + (0x8000 * (get_global_id(0) % MAX_GLOBAL_THREADS)));
    uchar outbuf[32];
    uchar data[PASSWORD_LEN];
    
    ((ulong8 *)data)[0] = ((__global const ulong8 *)input)[0];
    ((ulong *)data)[8] = ((__global const ulong *)input)[8];
    ((uint *)data)[18] = ((__global const uint *)input)[18];
    ((uint *)data)[19] = get_global_id(0);
    
    // X = KDF(password, salt)
    fastkdf(data, data, PASSWORD_LEN, (uchar *)X, 256);
    
    // Process ChaCha 1st, Salsa 2nd and XOR them - run that through PBKDF2
    CopyBytes128(Z, X, 2);
    
    // X = SMix(X); X & Z are swapped, repeat.    
    for(bool flag = false;; ++flag)
    {
        SMix(X, V, flag);
        if(flag) break;
        SwapBytes128(X, Z, 256);
    }
    
    // blkxor(X, Z)
    ((ulong16 *)X)[0] ^= ((ulong16 *)Z)[0];
    ((ulong16 *)X)[1] ^= ((ulong16 *)Z)[1];
    
    // output = KDF(password, X)
    fastkdf(data, (uchar *)X, FASTKDF_BUFFER_SIZE, outbuf, 32);
    if(((uint *)outbuf)[7] <= target) output[atomic_add(output + 0xFF, 1)] = get_global_id(0);
}

Delete the old neoscrypt bin file and new bin created will be 4% faster Wink
That is all I found myself.

Please remember, you buy better mods directly from wolf0. He has X13 mod for sale that gives another 50% boost to X13 algo hash. He not selling his latest neoscrypt algo.  
hero member
Activity: 518
Merit: 500
What are currently the best hashrates one can get with 7950 or 280x ?
legendary
Activity: 1792
Merit: 1008
/dev/null
Has anyone succeded in getting 2 GPUs -AMD 6950 \ 7970 working with X13 algo?
If so please show me your config file!

I've tried the kernel x13modold and it doesnt work...

any advice?Huh
it does, even with 5??? cards
tell us what isnt working. provide more details.
legendary
Activity: 1151
Merit: 1001
latest x13 optimisations are not compatible with 6xxx and 5xxx radeons
Best way is to run with -d switch - pointing to your 7970 and another instance using x13modold/marucoin-modold , again with -d but pointing to 6950 card
Best speed achieved by 6970 is ~ 1.4MH/s for x13
legendary
Activity: 1596
Merit: 1027
Has anyone succeded in getting 2 GPUs -AMD 6950 \ 7970 working with X13 algo?
If so please show me your config file!

I've tried the kernel x13modold and it doesnt work...

any advice?Huh
sr. member
Activity: 457
Merit: 273
Guys, why are you keeping this thread alive? All the excellent work that has been done by the OP author - lasybear and other developers has already been included in the "official" sgminer release (it's a community work, therefore I guess it can be called in any way, if you don't like "official" then it can be the "sgminer-dev" version). Information gets lost if it is kept in several separate threads, therefore I welcome you to discuss topics regarding sgminer and related stuff in this thread: https://bitcointalksearch.org/topic/ann-sgminer-v5-optimized-x11x13neoscryptlyra2reetc-kernel-switch-miner-632503

If you agree, we could consider this thread closed and continue the debate here: https://bitcointalk.org/index.php?topic=632503.new#new
Pages:
Jump to: