[ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 230.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Why not compute 2 hashes in parallell

#define SBOX_pipelined(a, b, c, d,a1,b1,c1,d1) { \
uint32_t t,t1; \
t = (a); \
t1= (a1); \
(a) &= (c); \
(a1) &= (c1); \
(a) ^= (d); \
(a1) ^= (d1); \
(c) ^= (b); \
(c1) ^= (b1); \
(c) ^= (a); \
(c1) ^= (a1); \
(d) |= t; \
(d1) |= t1; \
(d) ^= (b); \
(d1) ^= (b1); \
t ^= (c); \
t1 ^= (c1); \
Etc....

Wouldnt it remove the stalls?

Neo.op

sr. member

Activity: 1092

Merit: 254

Quote from: djm34 on July 22, 2014, 12:45:35 PM

Quote from: Neo.op on July 22, 2014, 12:22:03 PM

Quote from: djm34 on July 21, 2014, 01:53:00 PM

ccminer with whirlcoin support: https://github.com/djm34/ccminer

donation FrsvZzCqxkhQxfua31BggEeTdTXUcpy2JL

wondering how much does this take to compile? If I remember correctly last time around things took 2 hrs or so. Cry

Edit: It always get stuck at z24x13_fugue512_.... what algo is that exactly?

lol, you didn't compile yet whirlpool, the wait isn't over... Grin

(upgrade to 6.0 or 6.5 cuda version it will be faster)

yep still stuck, going along slowly. btw any chances of whirlpool only ccminer like fresh one?

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on July 22, 2014, 12:12:26 PM

In the hamsi and the SBOX macro there seems to be Read-after-write register dependency stalls.

Register Dependency
Read-after-write register dependency
Instruction’s result can be read ~24 cycles later

#define SBOX(a, b, c, d) { \
   uint32_t t; \
   t = (a); \
   (a) &= (c); \ (Stall)
   (a) ^= (d); \ (Stall)
   (c) ^= (b); \
   (c) ^= (a); \ (Stall)
   (d) |= t; \
   (d) ^= (b); \ (Stall)
   t ^= (c); \
   (b) = (d); \
   (d) |= t; \ (Stall)
   (d) ^= (a); \ (Stall)
   (a) &= (b); \ (Stall)
   t ^= (a); \ (Stall)
   (b) ^= (d); \
   (b) ^= t; \ (Stall)
   (a) = (c); \
   (c) = (b); \ (Stall)
   (b) = (d); \ (Stall)
   (d) = SPH_T32(~t); \ (Stall)
   }

I wrote one got removed though...

{
uint32_t t;
t = a;
asm("and.b32 %0,%0,%1;" : "+r"(a) : "r"(c));
asm("xor.b32 %0,%0,%1;" : "+r"(a) : "r"(d));
asm("xor.b32 %0,%0,%1;" : "+r"(c) : "r"(b));
asm("xor.b32 %0,%0,%1;" : "+r"(c) : "r"(a));
asm( "or.b32 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b32 %0,%0,%1;" : "+r"(d) : "r"(b));
asm("xor.b32 %0,%0,%1;" : "+r"(t) : "r"(c));
b=d;
asm( "or.b32 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b32 %0,%0,%1;" : "+r"(d) : "r"(a));
asm("and.b32 %0,%0,%1;" : "+r"(a) : "r"(b));
asm("xor.b32 %0,%0,%1;" : "+r"(t) : "r"(a));
asm("xor.b32 %0,%0,%1;" : "+r"(b) : "r"(d));
asm("xor.b32 %0,%0,%1;" : "+r"(b) : "r"(t));
a=c;
c=b;
b=d;
asm("not.b32 %0,%1;" : "=r"(d) : "r"(t));
//asm("xor.b32 %0,%0,0xFFFFFFFF;" : "+r"(d));

I assumed it was the same using several asm statement or one with several line.
(knowing that I need to declare additional temp variable )

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: Neo.op on July 22, 2014, 12:22:03 PM

Quote from: djm34 on July 21, 2014, 01:53:00 PM

ccminer with whirlcoin support: https://github.com/djm34/ccminer

donation FrsvZzCqxkhQxfua31BggEeTdTXUcpy2JL

wondering how much does this take to compile? If I remember correctly last time around things took 2 hrs or so. Cry

Edit: It always get stuck at z24x13_fugue512_.... what algo is that exactly?

lol, you didn't compile yet whirlpool, the wait isn't over... Grin

(upgrade to 6.0 or 6.5 cuda version it will be faster)

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on July 22, 2014, 11:56:20 AM

Quote from: djm34 on July 22, 2014, 11:37:12 AM

djm34:
if you can tell me how to make
.reg .b32 %rhs;
shl.b32 %lhs, %r321, 17;
shr.b32 %rhs, %r321, 15;
add.u32 %r322, %lhs, %rhs;

in one line, I will be happy though...

Compile a new version with my changes, disassemble and you will see.

Did you try to tweak the register count? Fixed to 80 for all compute versions seems a bit strange.

Fatbin ptx code:
================
arch = sm_35
code version = [3,2]
producer = cuda
host = windows
compile_size = 32bit
compressed
identifier = C:/CCMiner/nvminer/x13/cuda_shabal512.cu
ptxasOptions = -v -abi=no -v -maxrregcount=80

You don't really need to disassemble, ptx files are written during compilations for each kernels...

regarding the maxregcount, well it is a maxregcount, if it doesn't require that much it will use less...
(and Christian did it Grin

)
I played a bit with it, the problem is that the big kernels don't like it that much.
With the current maxregcount, the occupancy is around 50% for most of the kernel and decreasing it, doesn't really increase the performance.
Actually it works only for whirlpool where using 64 reg, gives somewhat better performance (but I get lots of spilled bytes... )

Regarding Shabal, it uses 66 registers for an occupancy of 37.5%. I guess I could decrease a bit the register count. but as I said computing time spent on shabal, I just checked, is only 1.4% on gtx750ti and 1.% on 780ti (luffa 3.1% and 3.9%), actually shabal is already one of the fastest algorithm of the bunch...

Neo.op

sr. member

Activity: 1092

Merit: 254

Quote from: djm34 on July 21, 2014, 01:53:00 PM

ccminer with whirlcoin support: https://github.com/djm34/ccminer

donation FrsvZzCqxkhQxfua31BggEeTdTXUcpy2JL

wondering how much does this take to compile? If I remember correctly last time around things took 2 hrs or so. Cry

Edit: It always get stuck at z24x13_fugue512_.... what algo is that exactly?

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

In the hamsi and the SBOX macro there seems to be Read-after-write register dependency stalls.

Register Dependency
Read-after-write register dependency
Instruction’s result can be read ~24 cycles later

#define SBOX(a, b, c, d) { \
   uint32_t t; \
   t = (a); \
   (a) &= (c); \ (Stall)
   (a) ^= (d); \ (Stall)
   (c) ^= (b); \
   (c) ^= (a); \ (Stall)
   (d) |= t; \
   (d) ^= (b); \ (Stall)
   t ^= (c); \
   (b) = (d); \
   (d) |= t; \ (Stall)
   (d) ^= (a); \ (Stall)
   (a) &= (b); \ (Stall)
   t ^= (a); \ (Stall)
   (b) ^= (d); \
   (b) ^= t; \ (Stall)
   (a) = (c); \
   (c) = (b); \ (Stall)
   (b) = (d); \ (Stall)
   (d) = SPH_T32(~t); \ (Stall)
   }

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: djm34 on July 22, 2014, 11:37:12 AM

djm34:
if you can tell me how to make
.reg .b32 %rhs;
shl.b32 %lhs, %r321, 17;
shr.b32 %rhs, %r321, 15;
add.u32 %r322, %lhs, %rhs;

in one line, I will be happy though...

Compile a new version with my changes, disassemble and you will see.

Did you try to tweak the register count? Fixed to 80 for all compute versions seems a bit strange.

Fatbin ptx code:
================
arch = sm_35
code version = [3,2]
producer = cuda
host = windows
compile_size = 32bit
compressed
identifier = C:/CCMiner/nvminer/x13/cuda_shabal512.cu
ptxasOptions = -v -abi=no -v -maxrregcount=80

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on July 22, 2014, 11:34:11 AM

cuda_x11_luffa512.cu:

#define MIXWORD(a0,a4)\
   a4 ^= a0;\
   a0 = (a0<<2) | (a0>>(30));\
   a0 ^= a4;\
   a4 = (a4<<14) | (a4>>(18));\
   a4 ^= a0;\
   a0 = (a0<<10) | (a0>>(22));\
   a0 ^= a4;\
   a4 = (a4<<1) | (a4>>(31));

-------->

   a4 ^= a0;\
   a0 = SPH_ROTL32(a0, 2);\
   a0 ^= a4;\
   a4 = SPH_ROTL32(a4, 14);\
   a4 ^= a0;\
   a0 = SPH_ROTL32(a0, 10);\
   a0 ^= a4;\
   a4 = SPH_ROTL32(a0, 1);\;



cuda_x11_cubehash512.cu:

#define ROTATEUPWARDS7(a) (((a) << 7) | ((a) >> 25))
#define ROTATEUPWARDS11(a) (((a) << 11) | ((a) >> 21))

------->

#define ROTATEUPWARDS7(a) (SPH_ROTL32(a, 7))
#define ROTATEUPWARDS11(a) (SPH_ROTL32(a, 11))

etc..

those I tried... no difference (unless they are in a big loop, it doesn't make a lot of difference)

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on July 22, 2014, 11:12:03 AM

djm34:

ccminer / x13 / cuda_shabal512.cu:

#define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm) { \
      xa0 = T32((xa0 \
         ^ (((xa1 << 15) | (xa1 >> 17)) * 5U) \
         ^ xc) * 3U) \
         ^ xb1 ^ (xb2 & ~xb3) ^ xm; \
      xb0 = T32(~(((xb0 << 1) | (xb0 >> 31)) ^ xa0)); \
   }

Rewrite to:

#define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm) { \
      xa0 = T32((xa0 \
         ^ ((SPH_ROTL32(xa1, 15) * 5U) \
         ^ xc) * 3U) \
         ^ xb1 ^ (xb2 & ~xb3) ^ xm; \
      xb0 = T32(~((SPH_ROTL32(xb0, 1) ^ xa0)); \
   }

I have disassembled the latest binary , still the funnel shift is not used in many of the algorithms x11/x13 etc . I am to lazy to make a build, and I don't have a maxwell card here to test.

(cuda_shabal512.cu: compute 3_5)

.reg .b32 %rhs;
shl.b32 %lhs, %r321, 17;
shr.b32 %rhs, %r321, 15;
add.u32 %r322, %lhs, %rhs;

I tried on some algo, but it doesn't make much difference.
Mostly because ccminer doesn't spend a lot of time on these algo.
Shabal represent something less than 5% of the overall time.

Things which needs improvement:
on 750ti: echo , groestl, whirlpool, hamsi (13%, 12.1%, 10.4%, 9.9% respectively)
on 780ti: hamsi, groestl, echo, fugue (15.9%; 12.5%; 12.1%; 7% resp.) whirlpool only 6.9%

(numbers are from a unreleased Xxx algo. Grin

)

So, sure one can certainly gain a little on luffa, shabal but it won't improve the overall perforamce of the algo.
(I tried on luffa actually...).
the card isn't bottlenecked by computing time (I mean calculation) but by reading those giant lookup tables (actually I am wondering if it wouldn't be faster to replace them by the original calculation... )

if you can tell me how to make
.reg .b32 %rhs;
shl.b32 %lhs, %r321, 17;
shr.b32 %rhs, %r321, 15;
add.u32 %r322, %lhs, %rhs;

in one line, I will be happy though...

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

cuda_x11_luffa512.cu:

#define MIXWORD(a0,a4)\
a4 ^= a0;\
a0 = (a0<<2) | (a0>>(30));\
a0 ^= a4;\
a4 = (a4<<14) | (a4>>(18));\
a4 ^= a0;\
a0 = (a0<<10) | (a0>>(22));\
a0 ^= a4;\
a4 = (a4<<1) | (a4>>(31));

-------->

a4 ^= a0;\
a0 = SPH_ROTL32(a0, 2);\
a0 ^= a4;\
a4 = SPH_ROTL32(a4, 14);\
a4 ^= a0;\
a0 = SPH_ROTL32(a0, 10);\
a0 ^= a4;\
a4 = SPH_ROTL32(a0, 1);\;

cuda_x11_cubehash512.cu:

#define ROTATEUPWARDS7(a) (((a) << 7) | ((a) >> 25))
#define ROTATEUPWARDS11(a) (((a) << 11) | ((a) >> 21))

------->

#define ROTATEUPWARDS7(a) (SPH_ROTL32(a, 7))
#define ROTATEUPWARDS11(a) (SPH_ROTL32(a, 11))

etc..

opus.224

full member

Activity: 247

Merit: 100

Quote from: exemplaar on July 22, 2014, 08:04:15 AM

Quote from: ltcnim on July 22, 2014, 07:27:22 AM

Quote from: ltcnim on July 21, 2014, 06:41:33 PM

Does anyone mine JPC @ dwarfpool and get a lot of "booo's" ? Since three days or so (since the fork), I only get like 88-90% good shares shown in the console (ubuntu 14.04 x64). But the admin of dwarfpool checked my shares, and they are up to 100% accepted. Why does it show only ~90% good shares on my side, but on the pool everything I send is accepted? Never had that problem with JPC before.

Hi.
Mining there with 3x750ti, no booo's at all in my case. Using official ccminer v1.2, getting 5.2-5.5 mhs per card.

Same here! More than 10% of booos. Normally it was 0.5-1.0%.
Running 6x750Ti, winXP_32.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

cuda_x11_luffa512.cu:

define TWEAK(a0,a1,a2,a3,j)\
a0 = (a0<<(j))|(a0>>(32-j));\
a1 = (a1<<(j))|(a1>>(32-j));\
a2 = (a2<<(j))|(a2>>(32-j));\
a3 = (a3<<(j))|(a3>>(32-j));

------->

define TWEAK(a0,a1,a2,a3,j)\
a0 = SPH_ROTL32(a0, j);\
a1 = SPH_ROTL32(a1, j);\
a2 = SPH_ROTL32(a2, j);\
a3 = SPH_ROTL32(a3, j);

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

djm34:

ccminer / x13 / cuda_shabal512.cu:

#define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm) { \
      xa0 = T32((xa0 \
         ^ (((xa1 << 15) | (xa1 >> 17)) * 5U) \
         ^ xc) * 3U) \
         ^ xb1 ^ (xb2 & ~xb3) ^ xm; \
      xb0 = T32(~(((xb0 << 1) | (xb0 >> 31)) ^ xa0)); \
   }

Rewrite to:

#define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm) { \
      xa0 = T32((xa0 \
         ^ ((SPH_ROTL32(xa1, 15) * 5U) \
         ^ xc) * 3U) \
         ^ xb1 ^ (xb2 & ~xb3) ^ xm; \
      xb0 = T32(~((SPH_ROTL32(xb0, 1) ^ xa0)); \
   }

I have disassembled the latest binary , still the funnel shift is not used in many of the algorithms x11/x13 etc . I am to lazy to make a build, and I don't have a maxwell card here to test.

(cuda_shabal512.cu: compute 3_5)

.reg .b32 %rhs;
shl.b32 %lhs, %r321, 17;
shr.b32 %rhs, %r321, 15;
add.u32 %r322, %lhs, %rhs;

can be reduced to one instruction.

tarzanbigcity

sr. member

Activity: 602

Merit: 250

Any plans to work in the split miner layout into the next few versions of ccminer or nvminer?

cayars

full member

Activity: 168

Merit: 100

Quote from: antonio8 on July 22, 2014, 09:58:47 AM

Quote from: cayars on July 22, 2014, 09:18:39 AM

I uploaded a new copy. Give it another shot.

Let me know the outcome.

Carlo

Is this with whirlcoin also like djm34?

Just curios before downloading.

Yes and the description also gives hash rates for each algo.

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: thefrog on July 22, 2014, 08:19:04 AM

Quote from: tarzanbigcity on July 21, 2014, 10:56:42 AM

Quote from: thefrog on July 21, 2014, 04:30:52 AM

Quote from: tarzanbigcity on July 19, 2014, 05:45:57 PM

I just wanted to say thanks for all the solid work guys on the new versions of ccminer, nvminer and that profit calculator. I just completed my AMD to nvidia conversion and wanted to share my progress. Many headaches but its now rocking and rolling. 900+ pages and going strong!

https://i.imgur.com/WyZVcYi.jpg

Nice,
I am also replacing my 280x with 750ti (40 to 65), will save like 1/2 of the power costs.

Here is a recommendation. Don't pick up any used cards on ebay. I got sent some cards that were problematic and they all came off ebay. Luckily I was able to swap them out for new ones due to Amazons lenient returns policy.

Thanks for the recommendation.
I only use new cards with warranty, had a series of 280x from Sapphire where Capacitors just blew after 2 weeks. Was no issue getting them refunded.

lol, buying used card, you probably gets card which were used 24/7 non stop by miners Grin

antonio8

legendary

Activity: 1400

Merit: 1000

Quote from: cayars on July 22, 2014, 09:18:39 AM

I uploaded a new copy. Give it another shot.

Let me know the outcome.

Carlo

Is this with whirlcoin also like djm34?

Just curios before downloading.

crz

member

Activity: 116

Merit: 10

Quote from: cayars on July 22, 2014, 09:18:39 AM

I uploaded a new copy. Give it another shot.

Let me know the outcome.

Carlo

Working perfect, thanks!

cayars

full member

Activity: 168

Merit: 100

I uploaded a new copy. Give it another shot.

Let me know the outcome.

Carlo

Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 230. (Read 3426989 times)