Phatk2 Mod (Already seeing improvement!)

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Maybe, I'll have to work on it. Right now, it's been on the back burner while I get some medical things sorted out. If you or anyone else wants to work on it, it's up for grabs.

ssateneth

legendary

Activity: 1344

Merit: 1004

Quote from: neo_rage on March 10, 2012, 10:57:56 PM

Any news?

interested in this too. I would test the kernel, but really has to be in a format where I can just put the kernel files where the other kernels are and just use it. no copy pasting of code, no special configuring inside the kernel (aside from kernel-specific options like setting GOFFSET to true), because I wouldn't know what to do. This is probably the case for other people that are interested too: they see you worked on phatk2 and see positive results and want to use it, but have no clue how to implement it.

Can we please get a kernel.cl and __init__.py, preferably phoenix 2.0 beta compatible?

neo_rage

full member

Activity: 196

Merit: 100

Any news?

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: Diapolo on February 22, 2012, 10:32:34 AM

By the way, my latest changes to DiaKGCN have introduced the same problem and VECTORS8 is now unusable slow ^^. Well some changes require to revert them, right Wink

?

Dia

Depends on what you're trying to do. If it's possible to asynchronously bring the vectors out of global memory, it could be possible to avoid reversions. But I think the reason why you were able to use VECTORS8 in the first place was because you were reusing your variables via += instead of creating new ones.

Diapolo

hero member

Activity: 772

Merit: 500

By the way, my latest changes to DiaKGCN have introduced the same problem and VECTORS8 is now unusable slow ^^. Well some changes require to revert them, right Wink

?

Dia

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

I'm really wanting to clean up the code and get rid of some of the multiple large vector variables that remain memory resident. I'll be overwriting them as it progresses through so as to lower the memory footprint and increase available vector sizes without the memory spill.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: Diapolo on February 20, 2012, 12:53:30 PM

I think you have got a little bug in your VEC8 code.

This:

Code:

W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA050000U, r ^ 0xC090000U, r ^ 0xE0D0000U};

should be replaced with:

Code:

W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA014000U, r ^ 0xC018000U, r ^ 0xE01C000U};

Dia

Dang it, I fixed that before, I guess I must have reverted it on another foul-up while playing with the code. Thanks for pointing it out. I wondered why my code suddenly didn't work with that option. Either way, I'm still ending up with it spilling over into memory.

Diapolo

hero member

Activity: 772

Merit: 500

I think you have got a little bug in your VEC8 code.

This:

Code:

W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA050000U, r ^ 0xC090000U, r ^ 0xE0D0000U};

should be replaced with:

Code:

W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA014000U, r ^ 0xC018000U, r ^ 0xE01C000U};

Dia

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Well, big surprise--it didn't work. I made a rookie mistake of using a linear function on a non-linear equation.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

I'm working on another theory right now that could potentially multiply the output hashes without doing very much more work. However, it's just a theory and it'll take some coding to verify. I'll work on it either tonight or later tomorrow. IF this works, you're gonna love it. Cool

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: blandead on February 13, 2012, 08:59:57 AM

Quote from: d3m0n1q_733rz on February 13, 2012, 07:24:26 AM

Here's the most recent changed phatk2 kernel. I've decided to directly output any found valid hashes instead of having it write to and then read from nonce. It works and it should give more accepted hashes (in theory) without dropping any and having the overhead of another variable.

Stack Overflow can be fixed with a barrier. Try putting this before the output, it only adds one ALU instruction and will prevent any overflows. put as many as you want to narrow down your problem.

barrier(CLK_GLOBAL_MEM_FENCE);

Also can you please replace the useless "WORKSIZE" variable, and just use "get_local_size(0)"

The select() function works right when it's actually a vector type variable, otherwise it will just do "result = c ? b : a" instead of "result = c ? b : a"

The only problems I've run into with stack overflows were with VECTORS8 and GOFFSET=false implementation. And I'm all for cutting out the crap and replacing with the original variable. We'll just toss in a comment that get_local_size(0) is the same as WORKSIZE after replacing it. It's better than define in my opinion. I only use define when there's a fairly large string of complex math to be repeated.
Going back to stack overflow, I'm guessing that it's not able to keep up with multiple nonce being output in a row? *sigh* Of course. Thanks for the tip. Feel free to give any modified sections if you type them up. We'll just toss them in, try them out and make them a part of the kernel if they cause for better output.

e21

member

Activity: 105

Merit: 10

blandead

newbie

Activity: 46

Merit: 0

Quote from: d3m0n1q_733rz on February 13, 2012, 07:24:26 AM

Here's the most recent changed phatk2 kernel. I've decided to directly output any found valid hashes instead of having it write to and then read from nonce. It works and it should give more accepted hashes (in theory) without dropping any and having the overhead of another variable.

Stack Overflow can be fixed with a barrier. Try putting this before the output, it only adds one ALU instruction and will prevent any overflows. put as many as you want to narrow down your problem.

barrier(CLK_GLOBAL_MEM_FENCE);

Also can you please replace the useless "WORKSIZE" variable, and just use "get_local_size(0)"

The select() function works right when it's actually a vector type variable, otherwise it will just do "result = c ? b : a" instead of "result = c ? b : a"

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Code:

// This file is in the public domain

#ifdef VECTORS8
typedef uint8 u;
#elif defined VECTORS4
typedef uint4 u;
#elif defined VECTORS
typedef uint2 u;
#else
typedef uint u;
#endif

__constant uint K[64] = {
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
};

__constant uint ConstW[128] = {
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x80000000U, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000280U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,

0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x80000000U, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000100U,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000
};

__constant uint H[8] = {
0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
};

#ifdef BITALIGN
#pragma OPENCL EXTENSION cl_amd_media_ops : enable
#define rot(x, y) amd_bitalign(x, x, (uint)(32 - y))
#else
#define rot(x, y) rotate(x, (uint)y)
#endif

// Some AMD devices have the BFI_INT opcode, which behaves exactly like the
// SHA-256 Ch function, but provides it in exactly one instruction. If
// detected, use it for Ch. Otherwise, use bitselect() for Ch.

#ifdef BFI_INT
// Well, slight problem... It turns out BFI_INT isn't actually exposed to
// OpenCL (or CAL IL for that matter) in any way. However, there is
// a similar instruction, BYTE_ALIGN_INT, which is exposed to OpenCL via
// amd_bytealign, takes the same inputs, and provides the same output.
// We can use that as a placeholder for BFI_INT and have the application
// patch it after compilation.

// This is the BFI_INT function
#define Ch(x, y, z) amd_bytealign(x,y,z)
// Ma can also be implemented in terms of BFI_INT...
#define Ma(z, x, y) amd_bytealign(z^x,y,x)
#else
#define Ch(x, y, z) bitselect(z,y,x)
#define Ma(x, y, z) bitselect(x,y,(z^x))
#endif

//Various intermediate calculations for each SHA round
#define s0(n) (S0(Vals[(0 + 128 - (n)) % 8]))
#define S0(n) (rot(n, 30u)^rot(n, 19u)^rot(n,10u))

#define s1(n) (S1(Vals[(4 + 128 - (n)) % 8]))
#define S1(n) (rot(n, 26u)^rot(n, 21u)^rot(n, 7u))

#define ch(n) Ch(Vals[(4 + 128 - (n)) % 8],Vals[(5 + 128 - (n)) % 8],Vals[(6 + 128 - (n)) % 8])
#define maj(n) Ma(Vals[(1 + 128 - (n)) % 8],Vals[(2 + 128 - (n)) % 8],Vals[(0 + 128 - (n)) % 8])

//t1 calc when W is already calculated
#define t1(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] + W[(n)] + s1(n) + ch(n)

//t1 calc which calculates W
#define t1W(n) K[(n) % 64] + Vals[(7 + 128 - (n)) % 8] + W(n) + s1(n) + ch(n)

//Used for constant W Values (the compiler optimizes out zeros)
#define t1C(n) (K[(n) % 64]+ ConstW[(n)]) + Vals[(7 + 128 - (n)) % 8] + s1(n) + ch(n)

//t2 Calc
#define t2(n) maj(n) + s0(n)

#define rotC(x,n) (x<> (32-n))

//W calculation used for SHA round
#define W(n) (W[n] = P4(n) + P3(n) + P2(n) + P1(n))

//Partial W calculations (used for the begining where only some values are nonzero)
#define P1(n) ((rot(W[(n)-2],15u)^rot(W[(n)-2],13u)^((W[(n)-2])>>10U)))
#define P2(n) ((rot(W[(n)-15],25u)^rot(W[(n)-15],14u)^((W[(n)-15])>>3U)))
#define p1(x) ((rot(x,15u)^rot(x,13u)^((x)>>10U)))
#define p2(x) ((rot(x,25u)^rot(x,14u)^((x)>>3U)))
#define P3(n) W[n-7]
#define P4(n) W[n-16]

//Partial Calcs for constant W values
#define P1C(n) ((rotC(ConstW[(n)-2],15)^rotC(ConstW[(n)-2],13)^((ConstW[(n)-2])>>10U)))
#define P2C(n) ((rotC(ConstW[(n)-15],25)^rotC(ConstW[(n)-15],14)^((ConstW[(n)-15])>>3U)))
#define P3C(x) ConstW[x-7]
#define P4C(x) ConstW[x-16]

//SHA round with built in W calc
#define sharoundW(n) Barrier1(n); Vals[(3 + 128 - (n)) % 8] += t1W(n); Vals[(7 + 128 - (n)) % 8] = t1W(n) + t2(n);

//SHA round without W calc
#define sharound(n) Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1(n); Vals[(7 + 128 - (n)) % 8] = t1(n) + t2(n);

//SHA round for constant W values
#define sharoundC(n) Barrier2(n); Vals[(3 + 128 - (n)) % 8] += t1C(n); Vals[(7 + 128 - (n)) % 8] = t1C(n) + t2(n);

//The compiler is stupid... I put this in there only to stop the compiler from (de)optimizing the order
#define Barrier1(n) t1 = t1C((n+1))
#define Barrier2(n) t1 = t1C((n))

__kernel
//removed this to allow detection of invalid work size
//__attribute__((reqd_work_group_size(WORKSIZE, 1, 1)))
void search( const uint state0, const uint state1, const uint state2, const uint state3,
const uint state4, const uint state5, const uint state6, const uint state7,
const uint B1, const uint C1, const uint D1,
const uint F1, const uint G1, const uint H1,
#ifndef GOFFSET
const u base,
#endif
const uint W16, const uint W17,
const uint PreVal4, const uint PreVal0,
const uint PreW31, const uint PreW32,
const uint PreW19, const uint PreW20,
__global uint * output)
{

u W[124];
u Vals[8];

//Dummy Variable to prevent compiler from reordering between rounds
u t1;

W[16] = W16;
W[17] = W17;

#ifdef VECTORS8
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 3) + (u)(0, 1, 2, 3, 4, 5, 6, 7);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + (uint)(get_local_id(0)) * 8u + (uint)(get_group_id(0)) * (WORKSIZE * 8u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U, r ^ 0x8010000U, r ^ 0xA050000U, r ^ 0xC090000U, r ^ 0xE0D0000U};
#endif

#elif defined VECTORS4
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 2) + (u)(0, 1, 2, 3);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else

//Less dependencies to get both the local id and group id and then add them
W[3] = base + (uint)(get_local_id(0)) * 4u + (uint)(get_group_id(0)) * (WORKSIZE * 4u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
//Since only the 2 LSB is opposite between the nonces, we can save an instruction by flipping the 4 bits in W18 rather than the 1 bit in W3
W[18] = PreW20 + (u){r, r ^ 0x2004000U, r ^ 0x4008000U, r ^ 0x600C000U};
#endif
#elif defined VECTORS
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0) << 1) + (u)(0, 1,);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + (uint)(get_local_id(0)) * 2u + (uint)(get_group_id(0)) * (WORKSIZE * 2u);
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
W[18] = PreW20 + (u){r, r ^ 0x2004000U};
#endif
#else
#ifdef GOFFSET
W[3] = ((uint)get_global_id(0));
uint r = rot(W[3].s0,25u)^rot(W[3].s0,14u)^((W[3].s0)>>3U);
#else
W[3] = base + get_local_id(0) + get_group_id(0) * (WORKSIZE);
u r = rot(W[3],25u)^rot(W[3],14u)^((W[3])>>3U);
W[18] = PreW20 + r;
#endif
#endif
//the order of the W calcs and Rounds is like this because the compiler needs help finding how to order the instructions

//Vals[0]=state0;
Vals[0] = PreVal0 + W[3];
Vals[1]=B1;
Vals[2]=C1;
Vals[3]=D1;
//Vals[4]=PreVal4;
Vals[4] = PreVal4 + W[3];
Vals[5]=F1;
Vals[6]=G1;
Vals[7]=H1;

sharoundC(4);
W[19] = PreW19 + W[3];
sharoundC(5);
W[20] = P4C(20) + P1(20);
sharoundC(6);
W[21] = P1(21);
sharoundC(7);
W[22] = P3C(22) + P1(22);
sharoundC(8);
W[23] = W[16] + P1(23);
sharoundC(9);
W[24] = W[17] + P1(24);
sharoundC(10);
W[25] = P3(25) + P1(25);
W[26] = P3(26) + P1(26);
sharoundC(11);
W[27] = P3(27) + P1(27);
W[28] = P3(28) + P1(28);
sharoundC(12);
W[29] = P3(29) + P1(29);
sharoundC(13);
W[30] = P3(30) + P2C(30) + P1(30);
W[31] = PreW31 + (P3(31) + P1(31));
sharoundC(14);
W[32] = PreW32 + (P3(32) + P1(32));
sharoundC(15);
sharound(16);
sharound(17);
sharound(18);
sharound(19);
sharound(20);
sharound(21);
sharound(22);
sharound(23);
sharound(24);
sharound(25);
sharound(26);
sharound(27);
sharound(28);
sharound(29);
sharound(30);
sharound(31);
sharound(32);
sharoundW(33);
sharoundW(34);
sharoundW(35);
sharoundW(36);
sharoundW(37);
sharoundW(38);
sharoundW(39);
sharoundW(40);
sharoundW(41);
sharoundW(42);
sharoundW(43);
sharoundW(44);
sharoundW(45);
sharoundW(46);
sharoundW(47);
sharoundW(48);
sharoundW(49);
sharoundW(50);
sharoundW(51);
sharoundW(52);
sharoundW(53);
sharoundW(54);
sharoundW(55);
sharoundW(56);
sharoundW(57);
sharoundW(58);
sharoundW(59);
sharoundW(60);
sharoundW(61);
sharoundW(62);
sharoundW(63);

W[64]=state0+Vals[0];
W[65]=state1+Vals[1];
W[66]=state2+Vals[2];
W[67]=state3+Vals[3];
W[68]=state4+Vals[4];
W[69]=state5+Vals[5];
W[70]=state6+Vals[6];
W[71]=state7+Vals[7];

const u Temp = (0xb0edbdd0U + K[0]) + W[64];
Vals[0]=H[0];
Vals[1]=H[1];
Vals[2]=H[2];
Vals[3]=0xa54ff53aU + Temp;
Vals[4]=H[4];
Vals[5]=H[5];
Vals[6]=H[6];
Vals[7]=0x08909ae5U + Temp;

#define P124(n) P2(n) + P1(n) + P4(n)

W[80] = P2(80) + P4(80);
sharound(65);
W[81] = P1C(81) + P2(81) + P4(81);
sharound(66);
W[82] = P124(82);
sharound(67);
W[83] = P124(83);
sharound(68);
W[84] = P124(84);
sharound(69);
W[85] = P124(85);
sharound(70);
W[86] = P4(86) + P3C(86) + P2(86) + P1(86);
sharound(71);
W[87] = P4(87) + P3(87) + P2C(87) + P1(87);
sharoundC(72);
W[88] = P1(88) + P4C(88) + P3(88);
sharoundC(73);
W[89] = P3(89) + P1(89);
sharoundC(74);
W[90] = P3(90) + P1(90);
sharoundC(75);
W[91] = P3(91) + P1(91);
sharoundC(76);
W[92] = P3(92) + P1(92);
sharoundC(77);
W[93] = P3(93) + P1(93);
W[94] = P3(94) + P2C(94) + P1(94);
sharoundC(78);
W[95] = P4C(95) + P3(95) + P2(95) + P1(95);
sharoundC(79);
sharound(80);
sharound(81);
sharound(82);
sharound(83);
sharound(84);
sharound(85);
sharound(86);
sharound(87);
sharound(88);
sharound(89);
sharound(90);
sharound(91);
sharound(92);
sharound(93);
sharound(94);
sharound(95);
sharoundW(96);
sharoundW(97);
sharoundW(98);
sharoundW(99);
sharoundW(100);
sharoundW(101);
sharoundW(102);
sharoundW(103);
sharoundW(104);
sharoundW(105);
sharoundW(106);
sharoundW(107);
sharoundW(108);
sharoundW(109);
sharoundW(110);
sharoundW(111);
sharoundW(112);
sharoundW(113);
sharoundW(114);
sharoundW(115);
sharoundW(116);
sharoundW(117);
sharoundW(118);
sharoundW(119);
sharoundW(120);
sharoundW(121);
sharoundW(122);

u v = W[117] + W[108] + Vals[3] + Vals[7] + P2(124) + P1(124) + Ch((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123),Vals[1],Vals[2]);
u g = -(K[60] + H[7]) - S1((Vals[0] + Vals[4]) + (K[59] + W(123)) + s1(123)+ ch(123));

// uint nonce = 0;
#ifdef VECTORS8
if (v.s0==g.s0)
{
output[WORKSIZE] = W[3].s0;
output[get_local_id(0)] = W[3].s0;
}
if (v.s1==g.s1)
{
output[WORKSIZE] = W[3].s1;
output[get_local_id(0)] = W[3].s1;
}
if (v.s2==g.s2)
{
output[WORKSIZE] = W[3].s2;
output[get_local_id(0)] = W[3].s2;
}
if (v.s3==g.s3)
{
output[WORKSIZE] = W[3].s3;
output[get_local_id(0)] = W[3].s3;
}
if (v.s4==g.s4)
{
output[WORKSIZE] = W[3].s4;
output[get_local_id(0)] = W[3].s4;
}
if (v.s5==g.s5)
{
output[WORKSIZE] = W[3].s5;
output[get_local_id(0)] = W[3].s5;
}
if (v.s6==g.s6)
{
output[WORKSIZE] = W[3].s6;
output[get_local_id(0)] = W[3].s6;
}
if (v.s7==g.s7)
{
output[WORKSIZE] = W[3].s7;
output[get_local_id(0)] = W[3].s7;
}
#elif defined VECTORS4
if (v.s0==g.s0)
{
output[WORKSIZE] = W[3].s0;
output[get_local_id(0)] = W[3].s0;
}
if (v.s1==g.s1)
{
output[WORKSIZE] = W[3].s1;
output[get_local_id(0)] = W[3].s1;
}
if (v.s2==g.s2)
{
output[WORKSIZE] = W[3].s2;
output[get_local_id(0)] = W[3].s2;
}
if (v.s3==g.s3)
{
output[WORKSIZE] = W[3].s3;
output[get_local_id(0)] = W[3].s3;
}

#elif defined VECTORS
if (v.s0==g.s0)
{
output[WORKSIZE] = W[3].s0;
output[get_local_id(0)] = W[3].s0;
}
if (v.s1==g.s1)
{
output[WORKSIZE] = W[3].s1;
output[get_local_id(0)] = W[3].s1;
}
#else
if (v==g)
{
output[WORKSIZE] = W[3];
output[get_local_id(0)] = W[3];
}
#endif
}

Here's the most recent changed phatk2 kernel. I've decided to directly output any found valid hashes instead of having it write to and then read from nonce. It works and it should give more accepted hashes (in theory) without dropping any and having the overhead of another variable.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: gat3way on February 12, 2012, 05:27:03 PM

Nothing. Earlier termination would cost you more than going through all the checks.

You cannot use preprocessor directives for that because v and g are not known in compile time, so forget about #ifdef's, #else's, #define's and so on. I've seen such confusion from people that have been coding in interpreted languages mostly and recently switched to C.

Anyway. If I were to search for improvements in the kernel (assuming I changed vector width to Cool

, perhaps the final checks is not the right place. If you have a look at the kernel, you'd notice that a lot of code has been "reordered" so that higher ALUPacking is achieved. For example sometimes several w[X] values are calculated in a row, sometimes it is done with each SHA256 round step. Another thing is order of operations in the macros, it is not random, I bet whoever coded it has profiled ALUPacking and chosen the best case. However, switching to uint8 would definitely break that. I believe you can get at least 1-2% performance improvement from tighter alupacking which is much more than what you'd get from saving several ALU ops in the final checks

I already accomplished this partially by changing around the P#(n) values as my most recent change. I shaved off around 10 cycles. But yeah, starting from the bottom and working my way up. The best way seems to be to directly export the nonce to the miner without using the uint nonce. This way, we don't worry about having more than one or checking redundant hashes at the end. Once this problem is solved, I'll work on the bigger fish.

gat3way

sr. member

Activity: 256

Merit: 250

Nothing. Earlier termination would cost you more than going through all the checks.

You cannot use preprocessor directives for that because v and g are not known in compile time, so forget about #ifdef's, #else's, #define's and so on. I've seen such confusion from people that have been coding in interpreted languages mostly and recently switched to C.

Anyway. If I were to search for improvements in the kernel (assuming I changed vector width to Cool

, perhaps the final checks is not the right place. If you have a look at the kernel, you'd notice that a lot of code has been "reordered" so that higher ALUPacking is achieved. For example sometimes several w[X] values are calculated in a row, sometimes it is done with each SHA256 round step. Another thing is order of operations in the macros, it is not random, I bet whoever coded it has profiled ALUPacking and chosen the best case. However, switching to uint8 would definitely break that. I believe you can get at least 1-2% performance improvement from tighter alupacking which is much more than what you'd get from saving several ALU ops in the final checks

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: gat3way on February 12, 2012, 08:32:10 AM

No, you can't do that with predication.

"I would daisy-chain some if statements together if I thought they would work."
Stated.
So, again, what CAN we do to accomplish the same thing as what I was attempting to convey? Perhaps replacing the () with Null?
But how do I get it to return after setting the nonce to a W[3] value? THAT'S the problem I'm trying to solve. I suppose a series of #elif statements would do it. If none are satisfied, then no nonce would be written. Then, I just check for an existing one and the rest is finished. Now, what I COULD do is skip nonce all together and output the nonce directly to the miner. This way, we don't have to worry about any more instructions, statements or variables.
We get to the result as quickly as possible and cut all the crap in between. This solves the problem of multiple nonce being wasted and cycles being lost. Anyone for this approach instead? Cool

Again, inspired by Diapolo code.

gat3way

sr. member

Activity: 256

Merit: 250

No, you can't do that with predication.

d3m0n1q_733rz

sr. member

Activity: 378

Merit: 250

Quote from: gat3way on February 12, 2012, 05:22:54 AM

prefetch is a noop on GPUs. It is useful in CPU kernels only to prefetch data in CPU cache (same as what _mm_prefetch() does).

Quote

if defined VECTORS4
(v.s0==g.s0) ? uint nonce = (W[3].s0); #endif : ();
(v.s1==g.s1) ? uint nonce = (W[3].s1); #endif : ();
(v.s2==g.s2) ? uint nonce = (W[3].s2); #endif : ();
(v.s3==g.s3) ? uint nonce = (W[3].s3); #endif : ();
...
#endif

This is also not possible, it's an illegal construction that would fail the compilation. (v.s0==g.s0) is evaluated at run-time and the results are unknown to the preprocessor. If you need to terminate execution before write, you can just do that:

if (!nonce) return;

I am not sure it would make much of a difference though.

I knew this was an illegal instruction and even said this in the exact same post you quoted from. I posted the above statement to ask if there was any way to accomplish the above task legally. So, it seems that a series of if statements may be the only legal way to accomplish it.

I want it to check if v.s0==g.s0 and, if so, uint nonce = (W[3].s0) then stop checking the series of if statements as none of the rest will do anything but add unnecessary cycles. If not, do nothing and continue to the next if statement. At the end, if (exists(nonce)) {output...}. I would daisy-chain some if statements together if I thought they would work.
if defined VECTORS4
(v.s0==g.s0) ? uint nonce = (W[3].s0), #return : (v.s1==g.s1) ? uint nonce = (W[3].s1), #return : (v.s2==g.s2) ? uint nonce = (W[3].s2), #return : (v.s3==g.s3) ? uint nonce = (W[3].s3) : ();
...
#endif

But I need to figure out how to accomplish the above without having to put [uint nonce =] at the beginning of it. So, again, what will work? Just a bunch of branched if statements in a row?

Topic: Phatk2 Mod (Already seeing improvement!) (Read 8099 times)