Author

Topic: Phatk2 Mod (Already seeing improvement!) (Read 8087 times)

sr. member
Activity: 378
Merit: 250
March 11, 2012, 05:11:53 PM
#54
Maybe, I'll have to work on it.  Right now, it's been on the back burner while I get some medical things sorted out.  If you or anyone else wants to work on it, it's up for grabs.
legendary
Activity: 1344
Merit: 1004
Any news?

interested in this too. I would test the kernel, but really has to be in a format where I can just put the kernel files where the other kernels are and just use it. no copy pasting of code, no special configuring inside the kernel (aside from kernel-specific options like setting GOFFSET to true), because I wouldn't know what to do. This is probably the case for other people that are interested too: they see you worked on phatk2 and see positive results and want to use it, but have no clue how to implement it.

Can we please get a kernel.cl and __init__.py, preferably phoenix 2.0 beta compatible?
full member
Activity: 196
Merit: 100
Any news?
sr. member
Activity: 378
Merit: 250
February 22, 2012, 05:26:08 PM
#51
By the way, my latest changes to DiaKGCN have introduced the same problem and VECTORS8 is now unusable slow ^^. Well some changes require to revert them, right Wink?

Dia
Depends on what you're trying to do.  If it's possible to asynchronously bring the vectors out of global memory, it could be possible to avoid reversions.  But I think the reason why you were able to use VECTORS8 in the first place was because you were reusing your variables via += instead of creating new ones.
hero member
Activity: 772
Merit: 500
February 22, 2012, 10:32:34 AM
#50
By the way, my latest changes to DiaKGCN have introduced the same problem and VECTORS8 is now unusable slow ^^. Well some changes require to revert them, right Wink?

Dia
sr. member
Activity: 378
Merit: 250
February 21, 2012, 08:34:08 PM
#49
I'm really wanting to clean up the code and get rid of some of the multiple large vector variables that remain memory resident.  I'll be overwriting them as it progresses through so as to lower the memory footprint and increase available vector sizes without the memory spill.
sr. member
Activity: 378
Merit: 250
February 20, 2012, 03:31:39 PM
#48
I think you have got a little bug in your VEC8 code.

This:
Code:
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA050000U, r ^ 0xC090000U, r ^ 0xE0D0000U};

should be replaced with:
Code:
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA014000U, r ^ 0xC018000U, r ^ 0xE01C000U};

Dia
Dang it, I fixed that before, I guess I must have reverted it on another foul-up while playing with the code.  Thanks for pointing it out.  I wondered why my code suddenly didn't work with that option.  Either way, I'm still ending up with it spilling over into memory.
hero member
Activity: 772
Merit: 500
February 20, 2012, 12:53:30 PM
#47
I think you have got a little bug in your VEC8 code.

This:
Code:
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA050000U, r ^ 0xC090000U, r ^ 0xE0D0000U};

should be replaced with:
Code:
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA014000U, r ^ 0xC018000U, r ^ 0xE01C000U};

Dia
sr. member
Activity: 378
Merit: 250
February 19, 2012, 10:53:54 PM
#46
Well, big surprise--it didn't work.  I made a rookie mistake of using a linear function on a non-linear equation.
sr. member
Activity: 378
Merit: 250
February 18, 2012, 10:35:15 PM
#45
I'm working on another theory right now that could potentially multiply the output hashes without doing very much more work.  However, it's just a theory and it'll take some coding to verify.  I'll work on it either tonight or later tomorrow.  IF this works, you're gonna love it.   Cool
sr. member
Activity: 378
Merit: 250
February 14, 2012, 08:37:51 AM
#44
sr. member
Activity: 378
Merit: 250
February 14, 2012, 04:09:07 AM
#43
Here's the most recent changed phatk2 kernel.  I've decided to directly output any found valid hashes instead of having it write to and then read from nonce.  It works and it should give more accepted hashes (in theory) without dropping any and having the overhead of another variable.

Stack Overflow can be fixed with a barrier. Try putting this before the output, it only adds one ALU instruction and will prevent any overflows. put as many as you want to narrow down your problem.

barrier(CLK_GLOBAL_MEM_FENCE);

Also can you please replace the useless "WORKSIZE" variable, and just use "get_local_size(0)"

The select() function works right when it's actually a vector type variable, otherwise it will just do "result = c ? b : a" instead of "result = c ? b : a"
The only problems I've run into with stack overflows were with VECTORS8 and GOFFSET=false implementation.  And I'm all for cutting out the crap and replacing with the original variable.  We'll just toss in a comment that get_local_size(0) is the same as WORKSIZE after replacing it.  It's better than define in my opinion.  I only use define when there's a fairly large string of complex math to be repeated.
Going back to stack overflow, I'm guessing that it's not able to keep up with multiple nonce being output in a row?  *sigh*  Of course.  Thanks for the tip.  Feel free to give any modified sections if you type them up.  We'll just toss them in, try them out and make them a part of the kernel if they cause for better output.
e21
member
Activity: 105
Merit: 10
February 13, 2012, 02:39:05 PM
#42
newbie
Activity: 46
Merit: 0
February 13, 2012, 08:59:57 AM
#41
Here's the most recent changed phatk2 kernel.  I've decided to directly output any found valid hashes instead of having it write to and then read from nonce.  It works and it should give more accepted hashes (in theory) without dropping any and having the overhead of another variable.

Stack Overflow can be fixed with a barrier. Try putting this before the output, it only adds one ALU instruction and will prevent any overflows. put as many as you want to narrow down your problem.

barrier(CLK_GLOBAL_MEM_FENCE);

Also can you please replace the useless "WORKSIZE" variable, and just use "get_local_size(0)"

The select() function works right when it's actually a vector type variable, otherwise it will just do "result = c ? b : a" instead of "result = c ? b : a"
sr. member
Activity: 378
Merit: 250
February 13, 2012, 07:24:26 AM
#40
Code:
// This file is in the public domain

#ifdef VECTORS8
typedef uint8 u;
#elif defined VECTORS4
typedef uint4 u;
#elif defined VECTORS
typedef uint2 u;
#else
typedef uint u;
#endif

__constant uint K[64] = {
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
};

__constant uint ConstW[128] = {
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x80000000U, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000280U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,

0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x80000000U, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000100U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000
};

__constant uint H[8] = {
0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
};

#ifdef BITALIGN
#pragma OPENCL EXTENSION cl_amd_media_ops : enable
#define rot(x, y) amd_bitalign(x, x, (uint)(32 - y))
#else
#define rot(x, y) rotate(x, (uint)y)
#endif

// Some AMD devices have the BFI_INT opcode, which behaves exactly like the
// SHA-256 Ch function, but provides it in exactly one instruction. If
// detected, use it for Ch. Otherwise, use bitselect() for Ch.

#ifdef BFI_INT
// Well, slight problem... It turns out BFI_INT isn't actually exposed to
// OpenCL (or CAL IL for that matter) in any way. However, there is
// a similar instruction, BYTE_ALIGN_INT, which is exposed to OpenCL via
// amd_bytealign, takes the same inputs, and provides the same output.
// We can use that as a placeholder for BFI_INT and have the application
// patch it after compilation.

// This is the BFI_INT function
#define Ch(x, y, z) amd_bytealign(x,y,z)
// Ma can also be implemented in terms of BFI_INT...
#define Ma(z, x, y) amd_bytealign(z^x,y,x)
#else
#define Ch(x, y, z) bitselect(z,y,x)
#define Ma(x, y, z) bitselect(x,y,(z^x))
#endif

//Various intermediate calculations for each SHA round
#define s0(n) (S0(Vals[(0 + 128 - (n)) % 8]))
#define S0(n) (rot(n, 30u)^rot(n, 19u)^rot(n,10u))

#define s1(n) (S1(Vals[(4 + 128 - (n)) % 8]))
#define S1(n) (rot(n, 26u)^rot(n, 21u)^rot(n, 7u))

#define ch(n) Ch(Vals[(4 + 128 - (n)) % 8],Vals[(5 + 128 - (n)) % 8],Vals[(6 + 128 - (n)) % 8])
#define maj(n) Ma(Vals[(1 + 128 - (n)) % 8],Vals[(2 + 128 - (n)) % 8],Vals[(0 + 128 - (n)) % 8])

//t1 calc when W is already calculated
#define t1(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] +  W[(n)] + s1(n) + ch(n)

//t1 calc which calculates W
#define t1W(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] +  W(n) + s1(n) + ch(n)

//Used for constant W Values (the compiler optimizes out zeros)
#define t1C(n) (K[(n) % 64]+ ConstW[(n)]) + Vals[(7 + 128 - (n)) % 8] + s1(n) + ch(n)

//t2 Calc
#define t2(n)  maj(n) + s0(n)

#define rotC(x,n) (x<> (32-n))

//W calculation used for SHA round
#define W(n) (W[n] = P4(n) + P3(n) + P2(n) + P1(n))


//Partial W calculations (used for the begining where only some values are nonzero)
#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define p1(x) ((rot(x,15u)^rot(x,13u)^((x)>>10U)))
#define p2(x) ((rot(x,25u)^rot(x,14u)^((x)>>3U)))
#define P3(n)  W[n-7]
#define P4(n)  W[n-16]

//Partial Calcs for constant W values
#define P1C(n) ((rotC(ConstW[(n)-2],15)^rotC(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotC(ConstW[(n)-15],25)^rotC(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x)  ConstW[x-7]
#define P4C(x)  ConstW[x-16]

//SHA round with built in W calc
#define sharoundW(n) Barrier1(n);  Vals[(3 + 128 - (n)) % 8] += t1W(n); Vals[(7 + 128 - (n)) % 8] = t1W(n) + t2(n); 

//SHA round without W calc
#define sharound(n)  Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1(n); Vals[(7 + 128 - (n)) % 8] = t1(n) + t2(n);

//SHA round for constant W values
#define sharoundC(n)  Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1C(n); Vals[(7 + 128 - (n)) % 8] = t1C(n) + t2(n);

//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order
#define Barrier1(n) t1 = t1C((n+1))
#define Barrier2(n) t1 = t1C((n))

__kernel
//removed this to allow detection of invalid work size
//__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
void search( const uint state0, const uint state1, const uint state2, const uint state3,
const uint state4, const uint state5, const uint state6, const uint state7,
const uint B1, const uint C1, const uint D1,
const uint F1, const uint G1, const uint H1,
#ifndef GOFFSET
const u base,
#endif
const uint W16, const uint W17,
const uint PreVal4, const uint PreVal0,
const uint PreW31, const uint PreW32,
const uint PreW19, const uint PreW20,
__global uint * output)
{

u W[124];
u Vals[8];

//Dummy Variable to prevent compiler from reordering between rounds
u t1;

W[16] = W16;
W[17] = W17;

#ifdef VECTORS8
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 3) + (u)(0, 1, 2, 3, 4, 5, 6, 7);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + (uint)(get_local_id(0)) * 8u + (uint)(get_group_id(0)) * (WORKSIZE * 8u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA050000U, r ^ 0xC090000U, r ^ 0xE0D0000U};
#endif

#elif defined VECTORS4
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 2) + (u)(0, 1, 2, 3);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else

//Less dependencies to get both the local id and group id and then add them
W[3] = base + (uint)(get_local_id(0)) * 4u + (uint)(get_group_id(0)) * (WORKSIZE * 4u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U};
#endif
#elif defined VECTORS
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 1) + (u)(0, 1,);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + (uint)(get_local_id(0)) * 2u + (uint)(get_group_id(0)) * (WORKSIZE * 2u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
W[18] = PreW20 + (u){r, r ^ 0x2004000U};
#endif
#else
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0));
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + get_local_id(0) + get_group_id(0) * (WORKSIZE);
u r = rot(W[3],25u)^rot(W[3],14u)^((W[3])>>3U);
W[18] = PreW20 + r;
#endif
#endif
//the order of the W calcs and Rounds is like this because the compiler needs help finding how to order the instructions

//Vals[0]=state0;
Vals[0] = PreVal0 + W[3];
Vals[1]=B1;
Vals[2]=C1;
Vals[3]=D1;
//Vals[4]=PreVal4;
Vals[4] = PreVal4 + W[3];
Vals[5]=F1;
Vals[6]=G1;
Vals[7]=H1;

sharoundC(4);
W[19] = PreW19 + W[3];
sharoundC(5);
W[20] = P4C(20) + P1(20);
sharoundC(6);
W[21] = P1(21);
sharoundC(7);
W[22] = P3C(22) + P1(22);
sharoundC(8);
W[23] = W[16] + P1(23);
sharoundC(9);
W[24] = W[17] + P1(24);
sharoundC(10);
W[25] = P3(25) + P1(25);
W[26] = P3(26) + P1(26);
sharoundC(11);
W[27] = P3(27) + P1(27);
W[28] = P3(28) + P1(28);
sharoundC(12);
W[29] = P3(29) + P1(29);
sharoundC(13);
W[30] = P3(30) + P2C(30) + P1(30);
W[31] = PreW31 + (P3(31) + P1(31));
sharoundC(14);
W[32] = PreW32 + (P3(32) + P1(32));
sharoundC(15);
sharound(16);
sharound(17);
sharound(18);
sharound(19);
sharound(20);
sharound(21);
sharound(22);
sharound(23);
sharound(24);
sharound(25);
sharound(26);
sharound(27);
sharound(28);
sharound(29);
sharound(30);
sharound(31);
sharound(32);
sharoundW(33);
sharoundW(34);
sharoundW(35);
sharoundW(36);
sharoundW(37);
sharoundW(38);
sharoundW(39);
sharoundW(40);
sharoundW(41);
sharoundW(42);
sharoundW(43);
sharoundW(44);
sharoundW(45);
sharoundW(46);
sharoundW(47);
sharoundW(48);
sharoundW(49);
sharoundW(50);
sharoundW(51);
sharoundW(52);
sharoundW(53);
sharoundW(54);
sharoundW(55);
sharoundW(56);
sharoundW(57);
sharoundW(58);
sharoundW(59);
sharoundW(60);
sharoundW(61);
sharoundW(62);
sharoundW(63);

W[64]=state0+Vals[0];
W[65]=state1+Vals[1];
W[66]=state2+Vals[2];
W[67]=state3+Vals[3];
W[68]=state4+Vals[4];
W[69]=state5+Vals[5];
W[70]=state6+Vals[6];
W[71]=state7+Vals[7];

const u Temp = (0xb0edbdd0U + K[0]) +  W[64];
Vals[0]=H[0];
Vals[1]=H[1];
Vals[2]=H[2];
Vals[3]=0xa54ff53aU + Temp;
Vals[4]=H[4];
Vals[5]=H[5];
Vals[6]=H[6];
Vals[7]=0x08909ae5U + Temp;

#define P124(n) P2(n) + P1(n) + P4(n)

W[80] = P2(80) + P4(80);
sharound(65);
W[81] = P1C(81) + P2(81) + P4(81);
sharound(66);
W[82] = P124(82);
sharound(67);
W[83] = P124(83);
sharound(68);
W[84] = P124(84);
sharound(69);
W[85] = P124(85);
sharound(70);
W[86] = P4(86) + P3C(86) + P2(86) + P1(86);
sharound(71);
W[87] = P4(87) + P3(87) + P2C(87) + P1(87);
sharoundC(72);
W[88] = P1(88) + P4C(88) + P3(88);
sharoundC(73);
W[89] = P3(89) + P1(89);
sharoundC(74);
W[90] = P3(90) + P1(90);
sharoundC(75);
W[91] = P3(91) + P1(91);
sharoundC(76);
W[92] = P3(92) + P1(92);
sharoundC(77);
W[93] = P3(93) + P1(93);
W[94] = P3(94) + P2C(94) + P1(94);
sharoundC(78);
W[95] = P4C(95) + P3(95) + P2(95) + P1(95);
sharoundC(79);
sharound(80);
sharound(81);
sharound(82);
sharound(83);
sharound(84);
sharound(85);
sharound(86);
sharound(87);
sharound(88);
sharound(89);
sharound(90);
sharound(91);
sharound(92);
sharound(93);
sharound(94);
sharound(95);
sharoundW(96);
sharoundW(97);
sharoundW(98);
sharoundW(99);
sharoundW(100);
sharoundW(101);
sharoundW(102);
sharoundW(103);
sharoundW(104);
sharoundW(105);
sharoundW(106);
sharoundW(107);
sharoundW(108);
sharoundW(109);
sharoundW(110);
sharoundW(111);
sharoundW(112);
sharoundW(113);
sharoundW(114);
sharoundW(115);
sharoundW(116);
sharoundW(117);
sharoundW(118);
sharoundW(119);
sharoundW(120);
sharoundW(121);
sharoundW(122);

u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123),Vals[1],Vals[2]);
u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123));

// uint nonce = 0;
#ifdef VECTORS8
if (v.s0==g.s0)
{
output[WORKSIZE] = W[3].s0;
output[get_local_id(0)] = W[3].s0;
}
if (v.s1==g.s1)
{
output[WORKSIZE] = W[3].s1;
output[get_local_id(0)] = W[3].s1;
}
if (v.s2==g.s2)
{
output[WORKSIZE] = W[3].s2;
output[get_local_id(0)] = W[3].s2;
}
if (v.s3==g.s3)
{
output[WORKSIZE] = W[3].s3;
output[get_local_id(0)] = W[3].s3;
}
if (v.s4==g.s4)
{
output[WORKSIZE] = W[3].s4;
output[get_local_id(0)] = W[3].s4;
}
if (v.s5==g.s5)
{
output[WORKSIZE] = W[3].s5;
output[get_local_id(0)] = W[3].s5;
}
if (v.s6==g.s6)
{
output[WORKSIZE] = W[3].s6;
output[get_local_id(0)] = W[3].s6;
}
if (v.s7==g.s7)
{
output[WORKSIZE] = W[3].s7;
output[get_local_id(0)] = W[3].s7;
}
#elif defined VECTORS4
if (v.s0==g.s0)
{
output[WORKSIZE] = W[3].s0;
output[get_local_id(0)] = W[3].s0;
}
if (v.s1==g.s1)
{
output[WORKSIZE] = W[3].s1;
output[get_local_id(0)] = W[3].s1;
}
if (v.s2==g.s2)
{
output[WORKSIZE] = W[3].s2;
output[get_local_id(0)] = W[3].s2;
}
if (v.s3==g.s3)
{
output[WORKSIZE] = W[3].s3;
output[get_local_id(0)] = W[3].s3;
}

#elif defined VECTORS
if (v.s0==g.s0)
{
output[WORKSIZE] = W[3].s0;
output[get_local_id(0)] = W[3].s0;
}
if (v.s1==g.s1)
{
output[WORKSIZE] = W[3].s1;
output[get_local_id(0)] = W[3].s1;
}
#else
if (v==g)
{
output[WORKSIZE] = W[3];
output[get_local_id(0)] = W[3];
}
#endif
}
Here's the most recent changed phatk2 kernel.  I've decided to directly output any found valid hashes instead of having it write to and then read from nonce.  It works and it should give more accepted hashes (in theory) without dropping any and having the overhead of another variable.
sr. member
Activity: 378
Merit: 250
February 12, 2012, 05:39:37 PM
#39
Nothing. Earlier termination would cost you more than going through all the checks.

You cannot use preprocessor directives for that because v and g are not known in compile time, so forget about #ifdef's, #else's, #define's and so on. I've seen such confusion from people that have been coding in interpreted languages mostly and recently switched to C.

Anyway. If I were to search for improvements in the kernel (assuming I changed vector width to Cool, perhaps the final checks is not the right place. If you have a look at the kernel, you'd notice that a lot of code has been "reordered" so that higher ALUPacking is achieved. For example sometimes several w[X] values are calculated in a row, sometimes it is done with each SHA256 round step. Another thing is order of operations in the macros, it is not random, I bet whoever coded it has profiled ALUPacking and chosen the best case. However, switching to uint8 would definitely break that. I believe you can get at least 1-2% performance improvement from tighter alupacking which is much more than what you'd get from saving several ALU ops in the final checks Smiley
I already accomplished this partially by changing around the P#(n) values as my most recent change.  I shaved off around 10 cycles.  But yeah, starting from the bottom and working my way up.  The best way seems to be to directly export the nonce to the miner without using the uint nonce.  This way, we don't worry about having more than one or checking redundant hashes at the end.  Once this problem is solved, I'll work on the bigger fish.
sr. member
Activity: 256
Merit: 250
February 12, 2012, 05:27:03 PM
#38
Nothing. Earlier termination would cost you more than going through all the checks.

You cannot use preprocessor directives for that because v and g are not known in compile time, so forget about #ifdef's, #else's, #define's and so on. I've seen such confusion from people that have been coding in interpreted languages mostly and recently switched to C.

Anyway. If I were to search for improvements in the kernel (assuming I changed vector width to Cool, perhaps the final checks is not the right place. If you have a look at the kernel, you'd notice that a lot of code has been "reordered" so that higher ALUPacking is achieved. For example sometimes several w[X] values are calculated in a row, sometimes it is done with each SHA256 round step. Another thing is order of operations in the macros, it is not random, I bet whoever coded it has profiled ALUPacking and chosen the best case. However, switching to uint8 would definitely break that. I believe you can get at least 1-2% performance improvement from tighter alupacking which is much more than what you'd get from saving several ALU ops in the final checks Smiley
sr. member
Activity: 378
Merit: 250
February 12, 2012, 08:49:34 AM
#37
No, you can't do that with predication.

"I would daisy-chain some if statements together if I thought they would work."
Stated.
So, again, what CAN we do to accomplish the same thing as what I was attempting to convey?  Perhaps replacing the () with Null?
But how do I get it to return after setting the nonce to a W[3] value?  THAT'S the problem I'm trying to solve.  I suppose a series of #elif statements would do it.  If none are satisfied, then no nonce would be written.  Then, I just check for an existing one and the rest is finished.  Now, what I COULD do is skip nonce all together and output the nonce directly to the miner.  This way, we don't have to worry about any more instructions, statements or variables.
We get to the result as quickly as possible and cut all the crap in between.  This solves the problem of multiple nonce being wasted and cycles being lost.  Anyone for this approach instead?   Cool  Again, inspired by Diapolo code.
sr. member
Activity: 256
Merit: 250
February 12, 2012, 08:32:10 AM
#36
No, you can't do that with predication.
sr. member
Activity: 378
Merit: 250
February 12, 2012, 05:40:35 AM
#35
prefetch is a noop on GPUs. It is useful in CPU kernels only to prefetch data in CPU cache (same as what _mm_prefetch() does).

Quote
if defined VECTORS4
(v.s0==g.s0) ? uint nonce = (W[3].s0); #endif : ();
(v.s1==g.s1) ? uint nonce = (W[3].s1); #endif : ();
(v.s2==g.s2) ? uint nonce = (W[3].s2); #endif : ();
(v.s3==g.s3) ? uint nonce = (W[3].s3); #endif : ();
...
#endif

This is also not possible, it's an illegal construction that would fail the compilation. (v.s0==g.s0) is evaluated at run-time and the results are unknown to the preprocessor. If you need to terminate execution before write, you can just do that:

if (!nonce) return;

I am not sure it would make much of a difference though.
I knew this was an illegal instruction and even said this in the exact same post you quoted from.  I posted the above statement to ask if there was any way to accomplish the above task legally.  So, it seems that a series of if statements may be the only legal way to accomplish it.

I want it to check if v.s0==g.s0 and, if so, uint nonce = (W[3].s0) then stop checking the series of if statements as none of the rest will do anything but add unnecessary cycles.  If not, do nothing and continue to the next if statement.  At the end, if (exists(nonce))  {output...}.  I would daisy-chain some if statements together if I thought they would work.
if defined VECTORS4
(v.s0==g.s0) ? uint nonce = (W[3].s0), #return : (v.s1==g.s1) ? uint nonce = (W[3].s1), #return : (v.s2==g.s2) ? uint nonce = (W[3].s2), #return : (v.s3==g.s3) ? uint nonce = (W[3].s3) : ();
...
#endif

But I need to figure out how to accomplish the above without having to put [uint nonce =] at the beginning of it.  So, again, what will work?  Just a bunch of branched if statements in a row?
sr. member
Activity: 256
Merit: 250
February 12, 2012, 05:22:54 AM
#34
prefetch is a noop on GPUs. It is useful in CPU kernels only to prefetch data in CPU cache (same as what _mm_prefetch() does).

Quote
if defined VECTORS4
(v.s0==g.s0) ? uint nonce = (W[3].s0); #endif : ();
(v.s1==g.s1) ? uint nonce = (W[3].s1); #endif : ();
(v.s2==g.s2) ? uint nonce = (W[3].s2); #endif : ();
(v.s3==g.s3) ? uint nonce = (W[3].s3); #endif : ();
...
#endif

This is also not possible, it's an illegal construction that would fail the compilation. (v.s0==g.s0) is evaluated at run-time and the results are unknown to the preprocessor. If you need to terminate execution before write, you can just do that:

if (!nonce) return;

I am not sure it would make much of a difference though.

sr. member
Activity: 378
Merit: 250
February 12, 2012, 12:54:45 AM
#33
May I ask you d3m0n1q_733rz, what do you do for a living? I just know a little of NASM and CISCA and this post seems so fancy already, it intrigued me into try to help, but I can't understand almost anything Embarrassed

I'm presently disabled.  And I started programming in assembly as well.  I have a degree in Network Systems Administration and I would like to find something along those lines in work, but so far not able to do so.
OpenCL isn't TOO difficult to learn, but I have trouble with the syntax of some commands like prefetch and the like.  I'm thinking about tossing a prefetch or two into the code to see if it'll increase the speed by much.  In particular, just before sharoundC to prepare K and ConstW if it's not already.  And then another prefetch to call the parts of H when it's needed for Vals.  I don't know if this could shave off a few cycles or not, but I plan to find out.   Wink
newbie
Activity: 12
Merit: 0
February 11, 2012, 02:18:18 PM
#32
May I ask you d3m0n1q_733rz, what do you do for a living? I just know a little of NASM and CISCA and this post seems so fancy already, it intrigued me into try to help, but I can't understand almost anything Embarrassed
sr. member
Activity: 378
Merit: 250
February 11, 2012, 11:02:18 AM
#31
Very slightly faster for my GPU.  This had some tweaking done in the order of addition of P1 and P3.  The result is increase in GPRs, but a decent drop in cycles/ALUs.  Also, an increase in hashing speed!
Code:
// This file is in the public domain

#ifdef VECTORS8
typedef uint8 u;
#elif defined VECTORS4
typedef uint4 u;
#elif defined VECTORS
typedef uint2 u;
#else
typedef uint u;
#endif

__constant uint K[64] = {
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
};

__constant uint ConstW[128] = {
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x80000000U, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000280U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,

0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x80000000U, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000100U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000
};

__constant uint H[8] = {
0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
};

#ifdef BITALIGN
#pragma OPENCL EXTENSION cl_amd_media_ops : enable
#define rot(x, y) amd_bitalign(x, x, (uint)(32 - y))
#else
#define rot(x, y) rotate(x, (uint)y)
#endif

// Some AMD devices have the BFI_INT opcode, which behaves exactly like the
// SHA-256 Ch function, but provides it in exactly one instruction. If
// detected, use it for Ch. Otherwise, use bitselect() for Ch.

#ifdef BFI_INT
// Well, slight problem... It turns out BFI_INT isn't actually exposed to
// OpenCL (or CAL IL for that matter) in any way. However, there is
// a similar instruction, BYTE_ALIGN_INT, which is exposed to OpenCL via
// amd_bytealign, takes the same inputs, and provides the same output.
// We can use that as a placeholder for BFI_INT and have the application
// patch it after compilation.

// This is the BFI_INT function
#define Ch(x, y, z) amd_bytealign(x,y,z)
// Ma can also be implemented in terms of BFI_INT...
#define Ma(z, x, y) amd_bytealign(z^x,y,x)
#else
#define Ch(x, y, z) bitselect(z,y,x)
#define Ma(x, y, z) bitselect(x,y,(z^x))
#endif

//Various intermediate calculations for each SHA round
#define s0(n) (S0(Vals[(0 + 128 - (n)) % 8]))
#define S0(n) (rot(n, 30u)^rot(n, 19u)^rot(n,10u))

#define s1(n) (S1(Vals[(4 + 128 - (n)) % 8]))
#define S1(n) (rot(n, 26u)^rot(n, 21u)^rot(n, 7u))

#define ch(n) Ch(Vals[(4 + 128 - (n)) % 8],Vals[(5 + 128 - (n)) % 8],Vals[(6 + 128 - (n)) % 8])
#define maj(n) Ma(Vals[(1 + 128 - (n)) % 8],Vals[(2 + 128 - (n)) % 8],Vals[(0 + 128 - (n)) % 8])

//t1 calc when W is already calculated
#define t1(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] +  W[(n)] + s1(n) + ch(n)

//t1 calc which calculates W
#define t1W(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] +  W(n) + s1(n) + ch(n)

//Used for constant W Values (the compiler optimizes out zeros)
#define t1C(n) (K[(n) % 64]+ ConstW[(n)]) + Vals[(7 + 128 - (n)) % 8] + s1(n) + ch(n)

//t2 Calc
#define t2(n)  maj(n) + s0(n)

#define rotC(x,n) (x<> (32-n))

//W calculation used for SHA round
#define W(n) (W[n] = P4(n) + P3(n) + P2(n) + P1(n))


//Partial W calculations (used for the begining where only some values are nonzero)
#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define p1(x) ((rot(x,15u)^rot(x,13u)^((x)>>10U)))
#define p2(x) ((rot(x,25u)^rot(x,14u)^((x)>>3U)))
#define P3(n)  W[n-7]
#define P4(n)  W[n-16]

//Partial Calcs for constant W values
#define P1C(n) ((rotC(ConstW[(n)-2],15)^rotC(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotC(ConstW[(n)-15],25)^rotC(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x)  ConstW[x-7]
#define P4C(x)  ConstW[x-16]

//SHA round with built in W calc
#define sharoundW(n) Barrier1(n);  Vals[(3 + 128 - (n)) % 8] += t1W(n); Vals[(7 + 128 - (n)) % 8] = t1W(n) + t2(n); 

//SHA round without W calc
#define sharound(n)  Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1(n); Vals[(7 + 128 - (n)) % 8] = t1(n) + t2(n);

//SHA round for constant W values
#define sharoundC(n)  Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1C(n); Vals[(7 + 128 - (n)) % 8] = t1C(n) + t2(n);

//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order
#define Barrier1(n) t1 = t1C((n+1))
#define Barrier2(n) t1 = t1C((n))

__kernel
//removed this to allow detection of invalid work size
//__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
void search( const uint state0, const uint state1, const uint state2, const uint state3,
const uint state4, const uint state5, const uint state6, const uint state7,
const uint B1, const uint C1, const uint D1,
const uint F1, const uint G1, const uint H1,
#ifndef GOFFSET
const u base,
#endif
const uint W16, const uint W17,
const uint PreVal4, const uint PreVal0,
const uint PreW31, const uint PreW32,
const uint PreW19, const uint PreW20,
__global uint * output)
{

u W[124];
u Vals[8];

//Dummy Variable to prevent compiler from reordering between rounds
u t1;

W[16] = W16;
W[17] = W17;

#ifdef VECTORS8
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 3) + (u)(0, 1, 2, 3, 4, 5, 6, 7);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + (uint)(get_local_id(0)) * 8u + (uint)(get_group_id(0)) * (WORKSIZE * 8u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA050000U, r ^ 0xC090000U, r ^ 0xE0D0000U};
#endif

#elif defined VECTORS4
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 2) + (u)(0, 1, 2, 3);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else

//Less dependencies to get both the local id and group id and then add them
W[3] = base + (uint)(get_local_id(0)) * 4u + (uint)(get_group_id(0)) * (WORKSIZE * 4u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U};
#endif
#elif defined VECTORS
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 1) + (u)(0, 1,);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + (uint)(get_local_id(0)) * 2u + (uint)(get_group_id(0)) * (WORKSIZE * 2u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
W[18] = PreW20 + (u){r, r ^ 0x2004000U};
#endif
#else
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0));
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + get_local_id(0) + get_group_id(0) * (WORKSIZE);
u r = rot(W[3],25u)^rot(W[3],14u)^((W[3])>>3U);
W[18] = PreW20 + r;
#endif
#endif
//the order of the W calcs and Rounds is like this because the compiler needs help finding how to order the instructions

//Vals[0]=state0;
Vals[0] = PreVal0 + W[3];
Vals[1]=B1;
Vals[2]=C1;
Vals[3]=D1;
//Vals[4]=PreVal4;
Vals[4] = PreVal4 + W[3];
Vals[5]=F1;
Vals[6]=G1;
Vals[7]=H1;

sharoundC(4);
W[19] = PreW19 + W[3];
sharoundC(5);
W[20] = P4C(20) + P1(20);
sharoundC(6);
W[21] = P1(21);
sharoundC(7);
W[22] = P3C(22) + P1(22);
sharoundC(8);
W[23] = W[16] + P1(23);
sharoundC(9);
W[24] = W[17] + P1(24);
sharoundC(10);
W[25] = P3(25) + P1(25);
W[26] = P3(26) + P1(26);
sharoundC(11);
W[27] = P3(27) + P1(27);
W[28] = P3(28) + P1(28);
sharoundC(12);
W[29] = P3(29) + P1(29);
sharoundC(13);
W[30] = P3(30) + P2C(30) + P1(30);
W[31] = PreW31 + (P3(31) + P1(31));
sharoundC(14);
W[32] = PreW32 + (P3(32) + P1(32));
sharoundC(15);
sharound(16);
sharound(17);
sharound(18);
sharound(19);
sharound(20);
sharound(21);
sharound(22);
sharound(23);
sharound(24);
sharound(25);
sharound(26);
sharound(27);
sharound(28);
sharound(29);
sharound(30);
sharound(31);
sharound(32);
sharoundW(33);
sharoundW(34);
sharoundW(35);
sharoundW(36);
sharoundW(37);
sharoundW(38);
sharoundW(39);
sharoundW(40);
sharoundW(41);
sharoundW(42);
sharoundW(43);
sharoundW(44);
sharoundW(45);
sharoundW(46);
sharoundW(47);
sharoundW(48);
sharoundW(49);
sharoundW(50);
sharoundW(51);
sharoundW(52);
sharoundW(53);
sharoundW(54);
sharoundW(55);
sharoundW(56);
sharoundW(57);
sharoundW(58);
sharoundW(59);
sharoundW(60);
sharoundW(61);
sharoundW(62);
sharoundW(63);

W[64]=state0+Vals[0];
W[65]=state1+Vals[1];
W[66]=state2+Vals[2];
W[67]=state3+Vals[3];
W[68]=state4+Vals[4];
W[69]=state5+Vals[5];
W[70]=state6+Vals[6];
W[71]=state7+Vals[7];

const u Temp = (0xb0edbdd0U + K[0]) +  W[64];
Vals[0]=H[0];
Vals[1]=H[1];
Vals[2]=H[2];
Vals[3]=0xa54ff53aU + Temp;
Vals[4]=H[4];
Vals[5]=H[5];
Vals[6]=H[6];
Vals[7]=0x08909ae5U + Temp;

#define P124(n) P2(n) + P1(n) + P4(n)

W[80] = P2(80) + P4(80);
sharound(65);
W[81] = P1C(81) + P2(81) + P4(81);
sharound(66);
W[82] = P124(82);
sharound(67);
W[83] = P124(83);
sharound(68);
W[84] = P124(84);
sharound(69);
W[85] = P124(85);
sharound(70);
W[86] = P4(86) + P3C(86) + P2(86) + P1(86);
sharound(71);
W[87] = P4(87) + P3(87) + P2C(87) + P1(87);
sharoundC(72);
W[88] = P1(88) + P4C(88) + P3(88);
sharoundC(73);
W[89] = P3(89) + P1(89);
sharoundC(74);
W[90] = P3(90) + P1(90);
sharoundC(75);
W[91] = P3(91) + P1(91);
sharoundC(76);
W[92] = P3(92) + P1(92);
sharoundC(77);
W[93] = P3(93) + P1(93);
W[94] = P3(94) + P2C(94) + P1(94);
sharoundC(78);
W[95] = P4C(95) + P3(95) + P2(95) + P1(95);
sharoundC(79);
sharound(80);
sharound(81);
sharound(82);
sharound(83);
sharound(84);
sharound(85);
sharound(86);
sharound(87);
sharound(88);
sharound(89);
sharound(90);
sharound(91);
sharound(92);
sharound(93);
sharound(94);
sharound(95);
sharoundW(96);
sharoundW(97);
sharoundW(98);
sharoundW(99);
sharoundW(100);
sharoundW(101);
sharoundW(102);
sharoundW(103);
sharoundW(104);
sharoundW(105);
sharoundW(106);
sharoundW(107);
sharoundW(108);
sharoundW(109);
sharoundW(110);
sharoundW(111);
sharoundW(112);
sharoundW(113);
sharoundW(114);
sharoundW(115);
sharoundW(116);
sharoundW(117);
sharoundW(118);
sharoundW(119);
sharoundW(120);
sharoundW(121);
sharoundW(122);

u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123),Vals[1],Vals[2]);
u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123));

// uint nonce = 0;
#ifdef VECTORS8
uint nonce = (v.s0==g.s0) ? W[3].s0 : 0;
nonce = (v.s1==g.s1) ? W[3].s1 : nonce;
nonce = (v.s2==g.s2) ? W[3].s2 : nonce;
nonce = (v.s3==g.s3) ? W[3].s3 : nonce;
nonce = (v.s4==g.s4) ? W[3].s4 : nonce;
nonce = (v.s5==g.s5) ? W[3].s5 : nonce;
nonce = (v.s6==g.s6) ? W[3].s6 : nonce;
nonce = (v.s7==g.s7) ? W[3].s7 : nonce;
#elif defined VECTORS4
uint nonce = (v.s0==g.s0) ? W[3].s0 : 0;
nonce = (v.s1==g.s1) ? W[3].s1 : nonce;
nonce = (v.s2==g.s2) ? W[3].s2 : nonce;
nonce = (v.s3==g.s3) ? W[3].s3 : nonce;
#elif defined VECTORS
uint nonce = (v.s0==g.s0) ? W[3].s0 : 0;
nonce = (v.s1==g.s1) ? W[3].s1 : nonce;
#else
uint nonce = (v==g) ? W[3] : 0
#endif
if(nonce!=0)
{
//Faster to shift the nonce by 2 due to 4-DWORD addressing and does not add more collisions
output[WORKSIZE] = nonce;
output[get_local_id(0)] = nonce;
}
}
sr. member
Activity: 378
Merit: 250
February 11, 2012, 09:30:18 AM
#30
So this new kernel should be used with 2.6 or 2.1 sdk?
It's not a new kernel, it's a mod to the existing one.  I'm trying to add Vectors8 support and disabling of global offset for increased output for SDK 2.6 while maintaining the functionality for SDK 2.1.  This method has already been used with Diapolo's GCN miner and has seen some improvement (with his kernel) on VLIW-based cards.  You can see the test I've done with hashing rate in his thread.
The problem I'm running into is that Phatk2 seems to have a difficult time handling 8 vectors at once due to memory constraints which effectively cut the number of hashes in half.  So, I'm hoping to fix this little problem and introduce the ability to disable global offset in order to maximize output.  I also hope to fix the problem of outputting multiple found nonce each round in the case that more than one acceptable share is found while hashing.  But I'm going to need people's help with this project.  It'll be the people's kernel so to speak.  ^_^

Please post a link if you can. I know Diapolo's thread but don't know exactly which page etc.

Thanks !
https://bitcointalk.org/index.php?topic=61406.20

DiakGCN results on ATI HD5450

VECTORS4 WORKSIZE=128 with GOFFSET=false 14.45 Mhash/s
VECTORS4 WORKSIZE=128 without GOFFSET=false 14.46 Mhash/s
VECTORS8 WORKSIZE=128 with GOFFSET=false 14.46 Mhash/s
VECTORS8 WORKSIZE=128 without GOFFSET=false 14.47 Mhash/s

VECTORS4 WORKSIZE=64 with GOFFSET=false 14.49 Mhash/s
VECTORS4 WORKSIZE=64 without GOFFSET=false 14.50 Mhash/s
VECTORS8 WORKSIZE=64 with GOFFSET=false 14.55 Mhash/s
VECTORS8 WORKSIZE=64 without GOFFSET=false 14.50 Mhash/s

VECTORS4 WORKSIZE=32 with GOFFSET=false 14.46 Mhash/s
VECTORS4 WORKSIZE=32 without GOFFSET=false 14.47 Mhash/s
VECTORS8 WORKSIZE=32 with GOFFSET=false 14.50 Mhash/s
VECTORS8 WORKSIZE=32 without GOFFSET=false 14.48 Mhash/s
hero member
Activity: 518
Merit: 500
February 11, 2012, 09:09:56 AM
#29
So this new kernel should be used with 2.6 or 2.1 sdk?
It's not a new kernel, it's a mod to the existing one.  I'm trying to add Vectors8 support and disabling of global offset for increased output for SDK 2.6 while maintaining the functionality for SDK 2.1.  This method has already been used with Diapolo's GCN miner and has seen some improvement (with his kernel) on VLIW-based cards.  You can see the test I've done with hashing rate in his thread.
The problem I'm running into is that Phatk2 seems to have a difficult time handling 8 vectors at once due to memory constraints which effectively cut the number of hashes in half.  So, I'm hoping to fix this little problem and introduce the ability to disable global offset in order to maximize output.  I also hope to fix the problem of outputting multiple found nonce each round in the case that more than one acceptable share is found while hashing.  But I'm going to need people's help with this project.  It'll be the people's kernel so to speak.  ^_^

Please post a link if you can. I know Diapolo's thread but don't know exactly which page etc.

Thanks !
sr. member
Activity: 378
Merit: 250
February 11, 2012, 08:57:47 AM
#28
Hey, anyone know how to end an if statement in the middle of a series of statements once a condition has been filled without having to check a variable?  Here's an example of what I WANT to work.

if defined VECTORS4
(v.s0==g.s0) ? uint nonce = (W[3].s0); #endif : ();
(v.s1==g.s1) ? uint nonce = (W[3].s1); #endif : ();
(v.s2==g.s2) ? uint nonce = (W[3].s2); #endif : ();
(v.s3==g.s3) ? uint nonce = (W[3].s3); #endif : ();
...
#endif

I thought about trying while, but that meant I would be required to use a write to make nonce exist.  That's another instruction I didn't need to include.  In short, I'm avoiding any writes that are not required and, once nonce is found, end the if statements immediately since no other checks will be useful.

Now, I know the above code doesn't work, but could someone tell me how to write it so that it does?  I'll also need to know if the "if (exists(nonce))" statement will work.
sr. member
Activity: 378
Merit: 250
February 11, 2012, 04:27:04 AM
#27
So this new kernel should be used with 2.6 or 2.1 sdk?
It's not a new kernel, it's a mod to the existing one.  I'm trying to add Vectors8 support and disabling of global offset for increased output for SDK 2.6 while maintaining the functionality for SDK 2.1.  This method has already been used with Diapolo's GCN miner and has seen some improvement (with his kernel) on VLIW-based cards.  You can see the test I've done with hashing rate in his thread.
The problem I'm running into is that Phatk2 seems to have a difficult time handling 8 vectors at once due to memory constraints which effectively cut the number of hashes in half.  So, I'm hoping to fix this little problem and introduce the ability to disable global offset in order to maximize output.  I also hope to fix the problem of outputting multiple found nonce each round in the case that more than one acceptable share is found while hashing.  But I'm going to need people's help with this project.  It'll be the people's kernel so to speak.  ^_^
legendary
Activity: 1344
Merit: 1004
February 11, 2012, 03:00:05 AM
#26
So this new kernel should be used with 2.6 or 2.1 sdk?

interested in this too. i'd prefer an improvement to my miners using 2.1 sdk.
full member
Activity: 193
Merit: 100
February 11, 2012, 01:51:26 AM
#25
So this new kernel should be used with 2.6 or 2.1 sdk?
sr. member
Activity: 378
Merit: 250
February 11, 2012, 01:24:06 AM
#24
Actually, I wouldn't mind if DiabloD3 were to help on this project.  The more minds, the better the outcome.
rjk
sr. member
Activity: 448
Merit: 250
1ngldh
February 11, 2012, 12:25:37 AM
#23
Once this is overcome, it should be the fastest kernel available.
Until DiabloD3 steps out of his lair with more voodoo magic. Grin
sr. member
Activity: 378
Merit: 250
February 11, 2012, 12:23:51 AM
#22
Well, we hope to make it that.  But I just rechecked my results since I was working on it in the middle of the night, and it seems it's not that different from the original.  Hoping to fix that once I get GOFFSET working properly with VECTORS8.  But we keep running into the same register spill problem.  Once this is overcome, it should be the fastest kernel available.
legendary
Activity: 1344
Merit: 1004
February 10, 2012, 09:39:59 PM
#21
So I see the thread title got changed to "seeing improvement" meaining its faster than current phatk2. Could you host the files somewhere so I can plug them into phoenix2 and see how much faster? There are so many code changes in this thread I have no clue what to copy paste, so I'd appreciate it if you had it uploaded somewhere Smiley
hero member
Activity: 518
Merit: 500
February 10, 2012, 07:06:57 PM
#20
So how much increase can we expect on a 5870 ?

Is this really the fastest kernel for a 5870 ?

Thanks !
sr. member
Activity: 378
Merit: 250
February 10, 2012, 07:06:25 PM
#19
I am kinda surprised that predication worked better than select(), usually it's just the opposite. Perhaps if you can send me both ISA dumps I can see what can be done to further improve that.

For the second part:

Quote
  u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123),Vals[1],Vals[2]);
   u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123));

Can we simplify these since they both contain (Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) ?  It would certainly reduce calculations a bit.  The only problem I see is Vals[1] and Vals[2] is inside of the parenthesis.  Now, I'm not familiar with the comma symbolization here, but if the parenthesis can be put on the inside next to the ch(123), it's as easy as dividing by ((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) to remove it and make the math simpler for the GPU.

I don't think it's worth trying.


P.S I don't think ALU ops is a good performance metric. Of course that's important, but there are other factors. GPR usage and number of clauses is also very important, so you have to profile the overall result. I've seen many times situations where you have two kernels, one has a bit less ALU ops, other has just one more clause and the second one behaves much worse. Similarily, the situation with GPR usage. I am currently working on a RAR password cracking kernel and that poses some fucking paradoxes. For example I have several kernels, one keeping everything in __private memory with large GPR usage, another one that shifts some to __local memory and a third one that keeps a small lookup table in __global memory. Paradox is that the first one is the slowest, GPR usage is ~90, performance is disgusting. The one that keeps part of the data in __local memory behaves much better, 36 GPRs used, much better occupancy, but performance still not what I expected. The kernel that uses an intermediate __global memory buffer is currently the fastest one, mostly because of the cached global memory with SDK 2.6. It's twice faster than the second one and times faster than the first one. I would never expect that.



I've already tested it via the kernel analyzer and it seems to be the best way to accomplish the task so far.  The analyzer tells cycles, ALUs, GPRs, output, etc.
sr. member
Activity: 256
Merit: 250
February 10, 2012, 09:36:24 AM
#18
I am kinda surprised that predication worked better than select(), usually it's just the opposite. Perhaps if you can send me both ISA dumps I can see what can be done to further improve that.

For the second part:

Quote
  u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123),Vals[1],Vals[2]);
   u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123));

Can we simplify these since they both contain (Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) ?  It would certainly reduce calculations a bit.  The only problem I see is Vals[1] and Vals[2] is inside of the parenthesis.  Now, I'm not familiar with the comma symbolization here, but if the parenthesis can be put on the inside next to the ch(123), it's as easy as dividing by ((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) to remove it and make the math simpler for the GPU.

I don't think it's worth trying.


P.S I don't think ALU ops is a good performance metric. Of course that's important, but there are other factors. GPR usage and number of clauses is also very important, so you have to profile the overall result. I've seen many times situations where you have two kernels, one has a bit less ALU ops, other has just one more clause and the second one behaves much worse. Similarily, the situation with GPR usage. I am currently working on a RAR password cracking kernel and that poses some fucking paradoxes. For example I have several kernels, one keeping everything in __private memory with large GPR usage, another one that shifts some to __local memory and a third one that keeps a small lookup table in __global memory. Paradox is that the first one is the slowest, GPR usage is ~90, performance is disgusting. The one that keeps part of the data in __local memory behaves much better, 36 GPRs used, much better occupancy, but performance still not what I expected. The kernel that uses an intermediate __global memory buffer is currently the fastest one, mostly because of the cached global memory with SDK 2.6. It's twice faster than the second one and times faster than the first one. I would never expect that.


sr. member
Activity: 378
Merit: 250
February 10, 2012, 09:17:01 AM
#17
Next part!

   u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123),Vals[1],Vals[2]);
   u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123));

Can we simplify these since they both contain (Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) ?  It would certainly reduce calculations a bit.  The only problem I see is Vals[1] and Vals[2] is inside of the parenthesis.  Now, I'm not familiar with the comma symbolization here, but if the parenthesis can be put on the inside next to the ch(123), it's as easy as dividing by ((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) to remove it and make the math simpler for the GPU.
I suppose not since Ch depends on that first value.  >_>  And then S1 rotates it.  Darn, not so easy.
sr. member
Activity: 378
Merit: 250
February 10, 2012, 08:49:50 AM
#16
Next part!

   u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123),Vals[1],Vals[2]);
   u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123));

Can we simplify these since they both contain (Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) ?  It would certainly reduce calculations a bit.  The only problem I see is Vals[1] and Vals[2] is inside of the parenthesis.  Now, I'm not familiar with the comma symbolization here, but if the parenthesis can be put on the inside next to the ch(123), it's as easy as dividing by ((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123)) to remove it and make the math simpler for the GPU.
sr. member
Activity: 378
Merit: 250
February 10, 2012, 08:15:55 AM
#15
Code:
// This file is in the public domain

#ifdef VECTORS8
typedef uint8 u;
#elif defined VECTORS4
typedef uint4 u;
#elif defined VECTORS
typedef uint2 u;
#else
typedef uint u;
#endif

__constant uint K[64] = {
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
};

__constant uint ConstW[128] = {
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x80000000U, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000280U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,

0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x80000000U, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000100U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000
};

__constant uint H[8] = {
0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
};

#ifdef BITALIGN
#pragma OPENCL EXTENSION cl_amd_media_ops : enable
#define rot(x, y) amd_bitalign(x, x, (uint)(32 - y))
#else
#define rot(x, y) rotate(x, (uint)y)
#endif

// Some AMD devices have the BFI_INT opcode, which behaves exactly like the
// SHA-256 Ch function, but provides it in exactly one instruction. If
// detected, use it for Ch. Otherwise, use bitselect() for Ch.

#ifdef BFI_INT
// Well, slight problem... It turns out BFI_INT isn't actually exposed to
// OpenCL (or CAL IL for that matter) in any way. However, there is
// a similar instruction, BYTE_ALIGN_INT, which is exposed to OpenCL via
// amd_bytealign, takes the same inputs, and provides the same output.
// We can use that as a placeholder for BFI_INT and have the application
// patch it after compilation.

// This is the BFI_INT function
#define Ch(x, y, z) amd_bytealign(x,y,z)
// Ma can also be implemented in terms of BFI_INT...
#define Ma(z, x, y) amd_bytealign(z^x,y,x)
#else
#define Ch(x, y, z) bitselect(z,y,x)
#define Ma(x, y, z) bitselect(x,y,(z^x))
#endif

//Various intermediate calculations for each SHA round
#define s0(n) (S0(Vals[(0 + 128 - (n)) % 8]))
#define S0(n) (rot(n, 30u)^rot(n, 19u)^rot(n,10u))

#define s1(n) (S1(Vals[(4 + 128 - (n)) % 8]))
#define S1(n) (rot(n, 26u)^rot(n, 21u)^rot(n, 7u))

#define ch(n) Ch(Vals[(4 + 128 - (n)) % 8],Vals[(5 + 128 - (n)) % 8],Vals[(6 + 128 - (n)) % 8])
#define maj(n) Ma(Vals[(1 + 128 - (n)) % 8],Vals[(2 + 128 - (n)) % 8],Vals[(0 + 128 - (n)) % 8])

//t1 calc when W is already calculated
#define t1(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] +  W[(n)] + s1(n) + ch(n)

//t1 calc which calculates W
#define t1W(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] +  W(n) + s1(n) + ch(n)

//Used for constant W Values (the compiler optimizes out zeros)
#define t1C(n) (K[(n) % 64]+ ConstW[(n)]) + Vals[(7 + 128 - (n)) % 8] + s1(n) + ch(n)

//t2 Calc
#define t2(n)  maj(n) + s0(n)

#define rotC(x,n) (x<> (32-n))

//W calculation used for SHA round
#define W(n) (W[n] = P4(n) + P3(n) + P2(n) + P1(n))


//Partial W calculations (used for the begining where only some values are nonzero)
#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define p1(x) ((rot(x,15u)^rot(x,13u)^((x)>>10U)))
#define p2(x) ((rot(x,25u)^rot(x,14u)^((x)>>3U)))
#define P3(n)  W[n-7]
#define P4(n)  W[n-16]

//Partial Calcs for constant W values
#define P1C(n) ((rotC(ConstW[(n)-2],15)^rotC(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotC(ConstW[(n)-15],25)^rotC(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x)  ConstW[x-7]
#define P4C(x)  ConstW[x-16]

//SHA round with built in W calc
#define sharoundW(n) Barrier1(n);  Vals[(3 + 128 - (n)) % 8] += t1W(n); Vals[(7 + 128 - (n)) % 8] = t1W(n) + t2(n);  

//SHA round without W calc
#define sharound(n)  Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1(n); Vals[(7 + 128 - (n)) % 8] = t1(n) + t2(n);

//SHA round for constant W values
#define sharoundC(n)  Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1C(n); Vals[(7 + 128 - (n)) % 8] = t1C(n) + t2(n);

//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order
#define Barrier1(n) t1 = t1C((n+1))
#define Barrier2(n) t1 = t1C((n))

__kernel
//removed this to allow detection of invalid work size
//__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
void search( const uint state0, const uint state1, const uint state2, const uint state3,
const uint state4, const uint state5, const uint state6, const uint state7,
const uint B1, const uint C1, const uint D1,
const uint F1, const uint G1, const uint H1,
#ifndef GOFFSET
const u base,
#endif
const uint W16, const uint W17,
const uint PreVal4, const uint PreVal0,
const uint PreW31, const uint PreW32,
const uint PreW19, const uint PreW20,
__global uint * output)
{

u W[124];
u Vals[8];

//Dummy Variable to prevent compiler from reordering between rounds
u t1;

W[16] = W16;
W[17] = W17;

#ifdef VECTORS8
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 3) + (u)(0, 1, 2, 3, 4, 5, 6, 7);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + (uint)(get_local_id(0)) * 8u + (uint)(get_group_id(0)) * (WORKSIZE * 8u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA050000U, r ^ 0xC090000U, r ^ 0xE0D0000U};
#endif

#elif defined VECTORS4
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 2) + (u)(0, 1, 2, 3);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else

//Less dependencies to get both the local id and group id and then add them
W[3] = base + (uint)(get_local_id(0)) * 4u + (uint)(get_group_id(0)) * (WORKSIZE * 4u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U};
#endif
#elif defined VECTORS
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 1) + (u)(0, 1,);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + (uint)(get_local_id(0)) * 2u + (uint)(get_group_id(0)) * (WORKSIZE * 2u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
W[18] = PreW20 + (u){r, r ^ 0x2004000U};
#endif
#else
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0));
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + get_local_id(0) + get_group_id(0) * (WORKSIZE);
u r = rot(W[3],25u)^rot(W[3],14u)^((W[3])>>3U);
W[18] = PreW20 + r;
#endif
#endif
//the order of the W calcs and Rounds is like this because the compiler needs help finding how to order the instructions

//Vals[0]=state0;
Vals[0] = PreVal0 + W[3];
Vals[1]=B1;
Vals[2]=C1;
Vals[3]=D1;
//Vals[4]=PreVal4;
Vals[4] = PreVal4 + W[3];
Vals[5]=F1;
Vals[6]=G1;
Vals[7]=H1;

sharoundC(4);
W[19] = PreW19 + W[3];
sharoundC(5);
W[20] = P4C(20) + P1(20);
sharoundC(6);
W[21] = P1(21);
sharoundC(7);
W[22] = P3C(22) + P1(22);
sharoundC(8);
W[23] = W[16] + P1(23);
sharoundC(9);
W[24] = W[17] + P1(24);
sharoundC(10);
W[25] = P1(25) + P3(25);
W[26] = P1(26) + P3(26);
sharoundC(11);
W[27] = P1(27) + P3(27);
W[28] = P1(28) + P3(28);
sharoundC(12);
W[29] = P1(29) + P3(29);
sharoundC(13);
W[30] = P1(30) + P2C(30) + P3(30);
W[31] = PreW31 + (P1(31) + P3(31));
sharoundC(14);
W[32] = PreW32 + (P1(32) + P3(32));
sharoundC(15);
sharound(16);
sharound(17);
sharound(18);
sharound(19);
sharound(20);
sharound(21);
sharound(22);
sharound(23);
sharound(24);
sharound(25);
sharound(26);
sharound(27);
sharound(28);
sharound(29);
sharound(30);
sharound(31);
sharound(32);
sharoundW(33);
sharoundW(34);
sharoundW(35);
sharoundW(36);
sharoundW(37);
sharoundW(38);
sharoundW(39);
sharoundW(40);
sharoundW(41);
sharoundW(42);
sharoundW(43);
sharoundW(44);
sharoundW(45);
sharoundW(46);
sharoundW(47);
sharoundW(48);
sharoundW(49);
sharoundW(50);
sharoundW(51);
sharoundW(52);
sharoundW(53);
sharoundW(54);
sharoundW(55);
sharoundW(56);
sharoundW(57);
sharoundW(58);
sharoundW(59);
sharoundW(60);
sharoundW(61);
sharoundW(62);
sharoundW(63);

W[64]=state0+Vals[0];
W[65]=state1+Vals[1];
W[66]=state2+Vals[2];
W[67]=state3+Vals[3];
W[68]=state4+Vals[4];
W[69]=state5+Vals[5];
W[70]=state6+Vals[6];
W[71]=state7+Vals[7];

const u Temp = (0xb0edbdd0U + K[0]) +  W[64];
Vals[0]=H[0];
Vals[1]=H[1];
Vals[2]=H[2];
Vals[3]=0xa54ff53aU + Temp;
Vals[4]=H[4];
Vals[5]=H[5];
Vals[6]=H[6];
Vals[7]=0x08909ae5U + Temp;

#define P124(n) P2(n) + P1(n) + P4(n)

W[80] = P2(80) + P4(80);
sharound(65);
W[81] = P1C(81) + P2(81) + P4(81);
sharound(66);
W[82] = P124(82);
sharound(67);
W[83] = P124(83);
sharound(68);
W[84] = P124(84);
sharound(69);
W[85] = P124(85);
sharound(70);
W[86] = P4(86) + P3C(86) + P2(86) + P1(86);
sharound(71);
W[87] = P4(87) + P3(87) + P2C(87) + P1(87);
sharoundC(72);
W[88] = P1(88) + P4C(88) + P3(88);
sharoundC(73);
W[89] = P3(89) + P1(89);
sharoundC(74);
W[90] = P3(90) + P1(90);
sharoundC(75);
W[91] = P3(91) + P1(91);
sharoundC(76);
W[92] = P3(92) + P1(92);
sharoundC(77);
W[93] = P1(93) + P3(93);
W[94] = P3(94) + P2C(94) + P1(94);
sharoundC(78);
W[95] = P4C(95) + P3(95) + P2(95) + P1(95);
sharoundC(79);
sharound(80);
sharound(81);
sharound(82);
sharound(83);
sharound(84);
sharound(85);
sharound(86);
sharound(87);
sharound(88);
sharound(89);
sharound(90);
sharound(91);
sharound(92);
sharound(93);
sharound(94);
sharound(95);
sharoundW(96);
sharoundW(97);
sharoundW(98);
sharoundW(99);
sharoundW(100);
sharoundW(101);
sharoundW(102);
sharoundW(103);
sharoundW(104);
sharoundW(105);
sharoundW(106);
sharoundW(107);
sharoundW(108);
sharoundW(109);
sharoundW(110);
sharoundW(111);
sharoundW(112);
sharoundW(113);
sharoundW(114);
sharoundW(115);
sharoundW(116);
sharoundW(117);
sharoundW(118);
sharoundW(119);
sharoundW(120);
sharoundW(121);
sharoundW(122);

u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123),Vals[1],Vals[2]);
u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123));

// uint nonce = 0;
#ifdef VECTORS8
uint nonce = (v.s0==g.s0) ? W[3].s0 : 0;
nonce = (v.s1==g.s1) ? W[3].s1 : nonce;
nonce = (v.s2==g.s2) ? W[3].s2 : nonce;
nonce = (v.s3==g.s3) ? W[3].s3 : nonce;
nonce = (v.s4==g.s4) ? W[3].s4 : nonce;
nonce = (v.s5==g.s5) ? W[3].s5 : nonce;
nonce = (v.s6==g.s6) ? W[3].s6 : nonce;
nonce = (v.s7==g.s7) ? W[3].s7 : nonce;
#elif defined VECTORS4
uint nonce = (v.s0==g.s0) ? W[3].s0 : 0;
nonce = (v.s1==g.s1) ? W[3].s1 : nonce;
nonce = (v.s2==g.s2) ? W[3].s2 : nonce;
nonce = (v.s3==g.s3) ? W[3].s3 : nonce;
#elif defined VECTORS
uint nonce = (v.s0==g.s0) ? W[3].s0 : 0;
nonce = (v.s1==g.s1) ? W[3].s1 : nonce;
#else
uint nonce = (v==g) ? W[3] : 0
#endif
if(nonce>0)
{
//Faster to shift the nonce by 2 due to 4-DWORD addressing and does not add more collisions
output[WORKSIZE] = nonce;
output[get_local_id(0)] = nonce;
}
}
Fastest code so far.
Now, is there anything that can be done for multiple found nonce values?  Outputting each one separately?
sr. member
Activity: 378
Merit: 250
February 10, 2012, 08:06:08 AM
#14
The problem I'm seeing, though, is that select causes the most inefficient code.  I'll give this other code a try and see what it comes up with.  But the if statements appear to be best since it checks against what's already in the buffer.  What it looks like is happening here with your new code is that it's causing writes to nonce regardless of whether or not the otherwise if statement is true.  I think that's what is causing more cycles.
-_-  So far, it looks like the if statements are most effective due to the lack of writes to and from vectors and nonce.

I just tested your new method.  It DROPPED 10 ALUs!  Congrats!  The code is now faster.
sr. member
Activity: 256
Merit: 250
February 10, 2012, 05:48:57 AM
#13
Yes, this is without any branching (similar to alternative 2) from my previous post except that I had W[3] wrong.

Basically the best would be to profile both and choose the faster one. With branching and without divergence, you have an additional clause (with divergence the penalty is worse as both "paths" would be serialized). However, without branching you introduce 7 dependent additions (can't pack them in two VLIW bundles as the result of the next addition depends on the previous one). I am not sure which would be faster.

BTW for the scalar case, you don't need that:

Code:
  #else
            v = select(W[3],(u)0,(v==g));
            uint nonce = (v);

as direct comparison might be faster, especially with predication. E.g:

Code:
nonce = (v==g) ? W[3] : 0;

Unfortunately, this is not useful in the vector case. Of course you could try:

Code:
nonce = (v.s0==g.s0) ? W[3].s0 : nonce;
nonce = (v.s1==g.s1) ? W[3].s1 : nonce;
...

But that would generate much more inefficient code than that generated by using select().



Quote
So, will having partial matches in a vector cause for any problems?

The only problem is when you have more than one matching component pairs (v.sX and g.sX). For example v.s0==g.s0 and v.s3==g.s3. The version with branches would eventually have one of the two nonces written correctly in the output buffer (namely W[3].s3), the version with select() would have the wrong nonce written in the output buffer (W[3].s0+W[3].s3).
sr. member
Activity: 378
Merit: 250
February 10, 2012, 05:11:36 AM
#12
I am compiling that with clcc and it builds successfully. It should work when you have VECTORS8 defined because you have:

Code:
#ifdef VECTORS8
typedef uint8 u;

and eq is defined to be of type u.

Quote
So it takes that match and makes it nonce.  However, it continues through the if statements as though it were looking for another match.  So, either A) nonce needs to be increased in size to hold multiple equivalent vectors or B) the if statements needs to be stopped once a suitable nonce is found otherwise it will only serve to overwrite the first.

Yes, that's a valid point, but having nonce as a vector means you should be also increasing the output buffer (vector width) times. This in turn means you'd need (vector width) times larger device-host transfers. People with underclocked GPU memory and PCIe extenders won't be very happy about that Smiley

Quote
B) the if statements needs to be stopped once a suitable nonce is found otherwise it will only serve to overwrite the first.

Yep, that's the purpose of replacing branches with select()

Quote
And I think you may be misusing select there.  You see, we need to pull apart v and g into it's separate parts before figuring out which parts are equal.  That is, unless we can xor v and g, pull apart v and then write any vector = 0 from the equivalent g vector.

No, you don't need to do that. The result of (v==g) is a vector where each component is 0 if the corresponding v and g components are equal. E.g you have:

v = (uint8)(5,5,5,5,5,5,5,5);
g = (uint8)(1,2,3,4,5,6,7,8);

(v==g) would be (0,0,0,0,1,0,0,0)

This is still not useful as nonce is a scalar value. Then also (I noticed that later and corrected it) nonce should equal the matching vector element from W[3], not v or g.

Thus, this is the most straightforward solution:

eq = select(W[3],(u)0,(v==g))

What's the idea? eq is a vector, same width as W[3], v and g.  


Let W[3] contain (0x10,0x20,0x30,0x40,0x50,0x60,0x70,0x80)

eq would contain (0,0,0,0,0x50,0,0,0)

since we need a scalar nonce, we just sum all the elements of eq and get 0+0+0+0+0x50+0+0+0 = 0x50.


Of course this would break if we have more than one match between v and g components and in that case the nonce would be wrong. The probability for this is low but it could happen. This is the worst case. Overall though, I think performance improvement due to branches elimination is worth the increased percentage of wrong shares. Also, a quick check on host could prevent the miner from submitting the wrong share. And as I said, this should occur rarely.




How about this instead?
#ifdef VECTORS8
            v = select(W[3],(u)0,(v==g));
            uint nonce = (v.s0+v.s1+v.s2+v.s3+v.s4+v.s5+v.s6+v.s7);
#elif defined VECTORS4
            v = select(W[3],(u)0,(v==g));
            uint nonce = (v.s0+v.s1+v.s2+v.s3);
#elif defined VECTORS
            v = select(W[3],(u)0,(v==g));
            uint nonce = (v.s0+v.s1);
   #else
            v = select(W[3],(u)0,(v==g));
            uint nonce = (v);
#endif

So, will having partial matches in a vector cause for any problems?
Also, using select is less efficient than using if statements.  Any other method?  And, as I suggested earlier, you might want the kernel analyzer as it will tell you the expected amount of cycles and the like.  I'm not telling you that you have to, but it really does come in handy.
sr. member
Activity: 256
Merit: 250
February 10, 2012, 03:59:44 AM
#11
I am compiling that with clcc and it builds successfully. It should work when you have VECTORS8 defined because you have:

Code:
#ifdef VECTORS8
typedef uint8 u;

and eq is defined to be of type u.

Quote
So it takes that match and makes it nonce.  However, it continues through the if statements as though it were looking for another match.  So, either A) nonce needs to be increased in size to hold multiple equivalent vectors or B) the if statements needs to be stopped once a suitable nonce is found otherwise it will only serve to overwrite the first.

Yes, that's a valid point, but having nonce as a vector means you should be also increasing the output buffer (vector width) times. This in turn means you'd need (vector width) times larger device-host transfers. People with underclocked GPU memory and PCIe extenders won't be very happy about that Smiley

Quote
B) the if statements needs to be stopped once a suitable nonce is found otherwise it will only serve to overwrite the first.

Yep, that's the purpose of replacing branches with select()

Quote
And I think you may be misusing select there.  You see, we need to pull apart v and g into it's separate parts before figuring out which parts are equal.  That is, unless we can xor v and g, pull apart v and then write any vector = 0 from the equivalent g vector.

No, you don't need to do that. The result of (v==g) is a vector where each component is 0 if the corresponding v and g components are equal. E.g you have:

v = (uint8)(5,5,5,5,5,5,5,5);
g = (uint8)(1,2,3,4,5,6,7,8);

(v==g) would be (0,0,0,0,1,0,0,0)

This is still not useful as nonce is a scalar value. Then also (I noticed that later and corrected it) nonce should equal the matching vector element from W[3], not v or g.

Thus, this is the most straightforward solution:

eq = select(W[3],(u)0,(v==g))

What's the idea? eq is a vector, same width as W[3], v and g. 


Let W[3] contain (0x10,0x20,0x30,0x40,0x50,0x60,0x70,0x80)

eq would contain (0,0,0,0,0x50,0,0,0)

since we need a scalar nonce, we just sum all the elements of eq and get 0+0+0+0+0x50+0+0+0 = 0x50.


Of course this would break if we have more than one match between v and g components and in that case the nonce would be wrong. The probability for this is low but it could happen. This is the worst case. Overall though, I think performance improvement due to branches elimination is worth the increased percentage of wrong shares. Also, a quick check on host could prevent the miner from submitting the wrong share. And as I said, this should occur rarely.



sr. member
Activity: 378
Merit: 250
February 09, 2012, 10:49:21 PM
#10
OK, try this way:

Code:
#ifdef VECTORS8
        if (any(v==g))
        {
              u eq = select(W[3],(u)0,(v==g));
              nonce = (eq.s0+eq.s1+eq.s2+eq.s3+eq.s4+eq.s5+eq.s6+eq.s7);
        }
#elif defined VECTORS4
That still won't work because eq must be of type uint8 to hold the data from v==g.  Why not download the AMD APP KernelAnalyzer and copy-paste this code into it with the compile options of:  -DOUTPUT_SIZE=0x100 -DOUTPUT_MASK=0xFF -DBFI_INT -DBITALIGN -DWORKSIZE=64 -DVECTORS8
That way, you can see where the errors are firsthand.  Now, the problem that I've found with the original code in general is that it makes the assumption that only one pair of vectors are going to match no matter what.  So it takes that match and makes it nonce.  However, it continues through the if statements as though it were looking for another match.  So, either A) nonce needs to be increased in size to hold multiple equivalent vectors or B) the if statements needs to be stopped once a suitable nonce is found otherwise it will only serve to overwrite the first.
I could really use Phateus here to help me rewrite this portion to hold to these constraints.
And I think you may be misusing select there.  You see, we need to pull apart v and g into it's separate parts before figuring out which parts are equal.  That is, unless we can xor v and g, pull apart v and then write any vector = 0 from the equivalent g vector.  However, that doesn't answer what to do in case of multiple nonce.
sr. member
Activity: 256
Merit: 250
February 09, 2012, 08:32:53 AM
#9
OK, try this way:

Code:
#ifdef VECTORS8
        if (any(v==g))
        {
              u eq = select(W[3],(u)0,(v==g));
              nonce = (eq.s0+eq.s1+eq.s2+eq.s3+eq.s4+eq.s5+eq.s6+eq.s7);
        }
#elif defined VECTORS4
sr. member
Activity: 378
Merit: 250
February 09, 2012, 07:10:33 AM
#8
I'd advise you to change this:

Code:
#ifdef VECTORS8
if (v.s0 == g.s0)
{
nonce = W[3].s0;
}
if (v.s1 == g.s1)
{
nonce = W[3].s1;
}
if (v.s2 == g.s2)
{
nonce = W[3].s2;
}
if (v.s3 == g.s3)
{
nonce = W[3].s3;
}
if (v.s4 == g.s4)
{
nonce = W[3].s4;
}
if (v.s5 == g.s5)
{
nonce = W[3].s5;
}
if (v.s6 == g.s6)
{
nonce = W[3].s6;
}
if (v.s7 == g.s7)
{
nonce = W[3].s7;
}
#elif defined VECTORS4

To this:

Code:
#ifdef VECTORS8
        uint8 eq=(v==g);
        if (any(eq))
        {
              eq = select(g,(uint8)0,eq);
              nonce = (eq.s0+eq.s1+eq.s2+eq.s3+eq.s4+eq.s5+eq.s6+eq.s7);
        }
#elif defined VECTORS4

A branchless version would be:

Code:
#ifdef VECTORS8
        uint8 eq;
        eq = select(g,(uint8)0,(v==g));
        nonce = (eq.s0+eq.s1+eq.s2+eq.s3+eq.s4+eq.s5+eq.s6+eq.s7);
#elif defined VECTORS4

It incurs a penalty of several more ALU ops which might be acceptable or might not as compared to the one-branch version and this needs to be profiled.

Anyway, having 8 branches is a bad idea, even without divergence, this introduces at least 8 clauses and clause latency is ~40 cycles on VLIW hardware. Should be better on GCN though.
There's something wrong with the alternatives.  It won't allow uint8 to be used with eq.
Another thing I would like to add is that there is a problem with efficiency of the nonce code.  If nonce is a uint, there's only enough room for a single vector from W[3].  So, if we're running through the entire v==g array and happen upon v.s0==g.s0 right off the bat, it's still going through the rest of the if statements to see if it needs to replace that w[3] value in nonce yet again.  My take on it is to ether stop there or make nonce large enough to hold any multiple nonce values and then introduce them into the next round as the full-sized vectors.  -_-  Too much editing of the kernel to add in efficiency.  I want to leave this up to someone more experienced with programming it if someone would be so willing.

List of things to do:
Increase efficiency of nonce at end.
Find register spill in VECTORS8 processing and put a stop to it.
Verify GOFFSET is working properly.
Figure out why the heck moving around the P1 + P2 + P4 etc. increases ALUs when the preprocessor should be able to optimize the simple addition problems for speed.
Achieve at least a 1/15th speed increase by code optimization.

It's going to take a few days in the least to get the nonce working properly with the GOFFSET option.  Since we've avoided using certain variables, it makes it difficult to weave it in.  The same applies since we're skipping straight to W[3].  I don't want the new code to take up more ALUs though.  It's a tough situation. 
sr. member
Activity: 378
Merit: 250
February 09, 2012, 06:33:08 AM
#7
oh, and I was just going to test this kernel too to see what the hell goffset did for hashes

Yeah interested as well. Can we expect like a massive performance increase from 440 mhash/s for a 5870 to 700 mhash/s Grin ?

Guess not but let us know !
I doubt it'll be a major increase, but it should be mild to modest if I can get everything working properly.  You can test GOFFSET now, but I can't promise that it'll do what it's supposed to without the init file.  Though, the option is there and it does accept the shares that it outputs.
hero member
Activity: 518
Merit: 500
February 09, 2012, 06:19:01 AM
#6
oh, and I was just going to test this kernel too to see what the hell goffset did for hashes

Yeah interested as well. Can we expect like a massive performance increase from 440 mhash/s for a 5870 to 700 mhash/s Grin ?

Guess not but let us know !
sr. member
Activity: 378
Merit: 250
February 09, 2012, 06:17:27 AM
#5
I'll try changing this for VECTORS8, 4 and 2 as there should be similar effects if it does as you say.  I'm also working on getting that darn GOFFSET put into the init file.  Not fun.
sr. member
Activity: 256
Merit: 250
February 09, 2012, 04:17:42 AM
#4
I'd advise you to change this:

Code:
#ifdef VECTORS8
if (v.s0 == g.s0)
{
nonce = W[3].s0;
}
if (v.s1 == g.s1)
{
nonce = W[3].s1;
}
if (v.s2 == g.s2)
{
nonce = W[3].s2;
}
if (v.s3 == g.s3)
{
nonce = W[3].s3;
}
if (v.s4 == g.s4)
{
nonce = W[3].s4;
}
if (v.s5 == g.s5)
{
nonce = W[3].s5;
}
if (v.s6 == g.s6)
{
nonce = W[3].s6;
}
if (v.s7 == g.s7)
{
nonce = W[3].s7;
}
#elif defined VECTORS4

To this:

Code:
#ifdef VECTORS8
        uint8 eq=(v==g);
        if (any(eq))
        {
              eq = select(g,(uint8)0,eq);
              nonce = (eq.s0+eq.s1+eq.s2+eq.s3+eq.s4+eq.s5+eq.s6+eq.s7);
        }
#elif defined VECTORS4

A branchless version would be:

Code:
#ifdef VECTORS8
        uint8 eq;
        eq = select(g,(uint8)0,(v==g));
        nonce = (eq.s0+eq.s1+eq.s2+eq.s3+eq.s4+eq.s5+eq.s6+eq.s7);
#elif defined VECTORS4

It incurs a penalty of several more ALU ops which might be acceptable or might not as compared to the one-branch version and this needs to be profiled.

Anyway, having 8 branches is a bad idea, even without divergence, this introduces at least 8 clauses and clause latency is ~40 cycles on VLIW hardware. Should be better on GCN though.
legendary
Activity: 1344
Merit: 1004
February 08, 2012, 09:44:08 PM
#3
oh, and I was just going to test this kernel too to see what the hell goffset did for hashes
sr. member
Activity: 378
Merit: 250
February 08, 2012, 09:43:05 PM
#2
Gah...I need to toss in an init file...crud.  I'll mess with it later.
sr. member
Activity: 378
Merit: 250
February 08, 2012, 08:43:39 PM
#1
Hey everyone!  I've decided to add Diapolo's GOFFSET option to Phatk2.  In some cases, it works better, in others, it doesn't.  Enable it to find out via GOFFSET!
This kernel is for Phoenix 2.0.0.

Currently, I test kernels on an ATI HD5450 graphics card.  If anyone wants to help further my OpenCL expertise, I happily welcome donations to 12HWUSguWXRCQKfkPeJygVR1ex5wbg3hAq.  They'll be put towards a new GPU or two (VLIW and GCN) and programming guides so I can compile optimum code.  I already have the ideas, now I just need to get them out in the open.  I'll also be helping Diapolo test his kernel and donate to help him out.  (A day's worth of hashing each week should be enough)

Code:
// This file is in the public domain

#ifdef VECTORS8
typedef uint8 u;
#elif defined VECTORS4
typedef uint4 u;
#elif defined VECTORS
typedef uint2 u;
#else
typedef uint u;
#endif

__constant uint K[64] = {
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
};

__constant uint ConstW[128] = {
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x80000000U, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000280U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,

0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x80000000U, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000100U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000
};

__constant uint H[8] = {
0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
};

#ifdef BITALIGN
#pragma OPENCL EXTENSION cl_amd_media_ops : enable
#define rot(x, y) amd_bitalign(x, x, (uint)(32 - y))
#else
#define rot(x, y) rotate(x, (uint)y)
#endif

// Some AMD devices have the BFI_INT opcode, which behaves exactly like the
// SHA-256 Ch function, but provides it in exactly one instruction. If
// detected, use it for Ch. Otherwise, use bitselect() for Ch.

#ifdef BFI_INT
// Well, slight problem... It turns out BFI_INT isn't actually exposed to
// OpenCL (or CAL IL for that matter) in any way. However, there is
// a similar instruction, BYTE_ALIGN_INT, which is exposed to OpenCL via
// amd_bytealign, takes the same inputs, and provides the same output.
// We can use that as a placeholder for BFI_INT and have the application
// patch it after compilation.

// This is the BFI_INT function
#define Ch(x, y, z) amd_bytealign(x,y,z)
// Ma can also be implemented in terms of BFI_INT...
#define Ma(z, x, y) amd_bytealign(z^x,y,x)
#else
#define Ch(x, y, z) bitselect(z,y,x)
#define Ma(x, y, z) bitselect(x,y,(z^x))
#endif

//Various intermediate calculations for each SHA round
#define s0(n) (S0(Vals[(0 + 128 - (n)) % 8]))
#define S0(n) (rot(n, 30u)^rot(n, 19u)^rot(n,10u))

#define s1(n) (S1(Vals[(4 + 128 - (n)) % 8]))
#define S1(n) (rot(n, 26u)^rot(n, 21u)^rot(n, 7u))

#define ch(n) Ch(Vals[(4 + 128 - (n)) % 8],Vals[(5 + 128 - (n)) % 8],Vals[(6 + 128 - (n)) % 8])
#define maj(n) Ma(Vals[(1 + 128 - (n)) % 8],Vals[(2 + 128 - (n)) % 8],Vals[(0 + 128 - (n)) % 8])

//t1 calc when W is already calculated
#define t1(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] +  W[(n)] + s1(n) + ch(n)

//t1 calc which calculates W
#define t1W(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] +  W(n) + s1(n) + ch(n)

//Used for constant W Values (the compiler optimizes out zeros)
#define t1C(n) (K[(n) % 64]+ ConstW[(n)]) + Vals[(7 + 128 - (n)) % 8] + s1(n) + ch(n)

//t2 Calc
#define t2(n)  maj(n) + s0(n)

#define rotC(x,n) (x<> (32-n))

//W calculation used for SHA round
#define W(n) (W[n] = P4(n) + P3(n) + P2(n) + P1(n))


//Partial W calculations (used for the begining where only some values are nonzero)
#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define p1(x) ((rot(x,15u)^rot(x,13u)^((x)>>10U)))
#define p2(x) ((rot(x,25u)^rot(x,14u)^((x)>>3U)))
#define P3(n)  W[n-7]
#define P4(n)  W[n-16]

//Partial Calcs for constant W values
#define P1C(n) ((rotC(ConstW[(n)-2],15)^rotC(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotC(ConstW[(n)-15],25)^rotC(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x)  ConstW[x-7]
#define P4C(x)  ConstW[x-16]

//SHA round with built in W calc
#define sharoundW(n) Barrier1(n);  Vals[(3 + 128 - (n)) % 8] += t1W(n); Vals[(7 + 128 - (n)) % 8] = t1W(n) + t2(n);  

//SHA round without W calc
#define sharound(n)  Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1(n); Vals[(7 + 128 - (n)) % 8] = t1(n) + t2(n);

//SHA round for constant W values
#define sharoundC(n)  Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1C(n); Vals[(7 + 128 - (n)) % 8] = t1C(n) + t2(n);

//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order
#define Barrier1(n) t1 = t1C((n+1))
#define Barrier2(n) t1 = t1C((n))

__kernel
//removed this to allow detection of invalid work size
//__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
void search( const uint state0, const uint state1, const uint state2, const uint state3,
const uint state4, const uint state5, const uint state6, const uint state7,
const uint B1, const uint C1, const uint D1,
const uint F1, const uint G1, const uint H1,
#ifndef GOFFSET
const u base,
#endif
const uint W16, const uint W17,
const uint PreVal4, const uint PreVal0,
const uint PreW31, const uint PreW32,
const uint PreW19, const uint PreW20,
__global uint * output)
{

u W[124];
u Vals[8];

//Dummy Variable to prevent compiler from reordering between rounds
u t1;

//Vals[0]=state0;
Vals[1]=B1;
Vals[2]=C1;
Vals[3]=D1;
//Vals[4]=PreVal4;
Vals[5]=F1;
Vals[6]=G1;
Vals[7]=H1;

W[16] = W16;
W[17] = W17;

#ifdef VECTORS8
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 3) + (u)(0, 1, 2, 3, 4, 5, 6, 7);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + (uint)(get_local_id(0)) * 8u + (uint)(get_group_id(0)) * (WORKSIZE * 8u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA050000U, r ^ 0xC090000U, r ^ 0xE0D0000U};
#endif

#elif defined VECTORS4
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 2) + (u)(0, 1, 2, 3);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else

//Less dependencies to get both the local id and group id and then add them
W[3] = base + (uint)(get_local_id(0)) * 4u + (uint)(get_group_id(0)) * (WORKSIZE * 4u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U};
#endif
#elif defined VECTORS
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 1) + (u)(0, 1,);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + (uint)(get_local_id(0)) * 2u + (uint)(get_group_id(0)) * (WORKSIZE * 2u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
W[18] = PreW20 + (u){r, r ^ 0x2004000U};
#endif
#else
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0));
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + get_local_id(0) + get_group_id(0) * (WORKSIZE);
u r = rot(W[3],25u)^rot(W[3],14u)^((W[3])>>3U);
W[18] = PreW20 + r;
#endif
#endif
//the order of the W calcs and Rounds is like this because the compiler needs help finding how to order the instructions

Vals[4] = PreVal4 + W[3];
Vals[0] = PreVal0 + W[3];

sharoundC(4);
W[19] = PreW19 + W[3];
sharoundC(5);
W[20] = P4C(20) + P1(20);
sharoundC(6);
W[21] = P1(21);
sharoundC(7);
W[22] = P3C(22) + P1(22);
sharoundC(8);
W[23] = W[16] + P1(23);
sharoundC(9);
W[24] = W[17] + P1(24);
sharoundC(10);
W[25] = P1(25) + P3(25);
W[26] = P1(26) + P3(26);
sharoundC(11);
W[27] = P1(27) + P3(27);
W[28] = P1(28) + P3(28);
sharoundC(12);
W[29] = P1(29) + P3(29);
sharoundC(13);
W[30] = P1(30) + P2C(30) + P3(30);
W[31] = PreW31 + (P1(31) + P3(31));
sharoundC(14);
W[32] = PreW32 + (P1(32) + P3(32));
sharoundC(15);
sharound(16);
sharound(17);
sharound(18);
sharound(19);
sharound(20);
sharound(21);
sharound(22);
sharound(23);
sharound(24);
sharound(25);
sharound(26);
sharound(27);
sharound(28);
sharound(29);
sharound(30);
sharound(31);
sharound(32);
sharoundW(33);
sharoundW(34);
sharoundW(35);
sharoundW(36);
sharoundW(37);
sharoundW(38);
sharoundW(39);
sharoundW(40);
sharoundW(41);
sharoundW(42);
sharoundW(43);
sharoundW(44);
sharoundW(45);
sharoundW(46);
sharoundW(47);
sharoundW(48);
sharoundW(49);
sharoundW(50);
sharoundW(51);
sharoundW(52);
sharoundW(53);
sharoundW(54);
sharoundW(55);
sharoundW(56);
sharoundW(57);
sharoundW(58);
sharoundW(59);
sharoundW(60);
sharoundW(61);
sharoundW(62);
sharoundW(63);

W[64]=state0+Vals[0];
W[65]=state1+Vals[1];
W[66]=state2+Vals[2];
W[67]=state3+Vals[3];
W[68]=state4+Vals[4];
W[69]=state5+Vals[5];
W[70]=state6+Vals[6];
W[71]=state7+Vals[7];

const u Temp = (0xb0edbdd0U + K[0]) +  W[64];
Vals[0]=H[0];
Vals[1]=H[1];
Vals[2]=H[2];
Vals[3]=0xa54ff53aU + Temp;
Vals[4]=H[4];
Vals[5]=H[5];
Vals[6]=H[6];
Vals[7]=0x08909ae5U + Temp;

#define P124(n) P2(n) + P1(n) + P4(n)

W[80] = P2(80) + P4(80);
sharound(65);
W[81] = P1C(81) + P2(81) + P4(81);
sharound(66);
W[82] = P124(82);
sharound(67);
W[83] = P124(83);
sharound(68);
W[84] = P124(84);
sharound(69);
W[85] = P124(85);
sharound(70);
W[86] = P4(86) + P3C(86) + P2(86) + P1(86);
sharound(71);
W[87] = P4(87) + P3(87) + P2C(87) + P1(87);
sharoundC(72);
W[88] = P1(88) + P4C(88) + P3(88);
sharoundC(73);
W[89] = P3(89) + P1(89);
sharoundC(74);
W[90] = P3(90) + P1(90);
sharoundC(75);
W[91] = P3(91) + P1(91);
sharoundC(76);
W[92] = P3(92) + P1(92);
sharoundC(77);
W[93] = P1(93) + P3(93);
W[94] = P3(94) + P2C(94) + P1(94);
sharoundC(78);
W[95] = P4C(95) + P3(95) + P2(95) + P1(95);
sharoundC(79);
sharound(80);
sharound(81);
sharound(82);
sharound(83);
sharound(84);
sharound(85);
sharound(86);
sharound(87);
sharound(88);
sharound(89);
sharound(90);
sharound(91);
sharound(92);
sharound(93);
sharound(94);
sharound(95);
sharoundW(96);
sharoundW(97);
sharoundW(98);
sharoundW(99);
sharoundW(100);
sharoundW(101);
sharoundW(102);
sharoundW(103);
sharoundW(104);
sharoundW(105);
sharoundW(106);
sharoundW(107);
sharoundW(108);
sharoundW(109);
sharoundW(110);
sharoundW(111);
sharoundW(112);
sharoundW(113);
sharoundW(114);
sharoundW(115);
sharoundW(116);
sharoundW(117);
sharoundW(118);
sharoundW(119);
sharoundW(120);
sharoundW(121);
sharoundW(122);

u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123),Vals[1],Vals[2]);
u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123));

uint nonce = 0;
#ifdef VECTORS8
if (v.s0 == g.s0)
{
nonce = W[3].s0;
}
if (v.s1 == g.s1)
{
nonce = W[3].s1;
}
if (v.s2 == g.s2)
{
nonce = W[3].s2;
}
if (v.s3 == g.s3)
{
nonce = W[3].s3;
}
if (v.s4 == g.s4)
{
nonce = W[3].s4;
}
if (v.s5 == g.s5)
{
nonce = W[3].s5;
}
if (v.s6 == g.s6)
{
nonce = W[3].s6;
}
if (v.s7 == g.s7)
{
nonce = W[3].s7;
}
#elif defined VECTORS4
if (v.s0 == g.s0)
{
nonce = W[3].s0;
}
if (v.s1 == g.s1)
{
nonce = W[3].s1;
}
if (v.s2 == g.s2)
{
nonce = W[3].s2;
}
if (v.s3 == g.s3)
{
nonce = W[3].s3;
}
#elif defined VECTORS
if (v.s0 == g.s0)
{
nonce = W[3].s0;
}
if (v.s1 == g.s1)
{
nonce = W[3].s1;
}
#else
if (v == g)
{
nonce = W[3];
}
#endif
if(nonce)
{
//Faster to shift the nonce by 2 due to 4-DWORD addressing and does not add more collisions
output[WORKSIZE] = nonce;
output[get_local_id(0)] = nonce;
}
}
As you may be able to tell, I've started adding VECTORS8 to the code as well, but I am having difficulty keeping it from spilling over into memory.  If someone could assist me with this, I would appreciate it.
Jump to: