Author

Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 230. (Read 3426989 times)

sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Why not compute 2 hashes in parallell

#define SBOX_pipelined(a, b, c, d,a1,b1,c1,d1) { \
        uint32_t t,t1; \
        t = (a); \
        t1= (a1); \
        (a) &= (c); \
        (a1) &= (c1); \
        (a) ^= (d); \ 
        (a1) ^= (d1); \ 
        (c) ^= (b); \
        (c1) ^= (b1); \ 
        (c) ^= (a); \
        (c1) ^= (a1); \
        (d) |= t; \
        (d1) |= t1; \
        (d) ^= (b); \
        (d1) ^= (b1); \
        t ^= (c); \
        t1 ^= (c1); \
 Etc....

Wouldnt it remove the stalls?
     
sr. member
Activity: 1092
Merit: 254
ccminer with whirlcoin support: https://github.com/djm34/ccminer

donation FrsvZzCqxkhQxfua31BggEeTdTXUcpy2JL
wondering how much does this take to compile? If I remember correctly last time around things took 2 hrs or so.  Cry

Edit: It always get stuck at z24x13_fugue512_.... what algo is that exactly?
lol, you didn't compile yet whirlpool, the wait isn't over...  Grin
(upgrade to 6.0 or 6.5 cuda version it will be faster)
yep still stuck, going along slowly. btw any chances of whirlpool only ccminer like fresh one?
legendary
Activity: 1400
Merit: 1050
In the hamsi and the SBOX macro there seems to be Read-after-write register dependency stalls.


Register Dependency
Read-after-write register dependency
Instruction’s result can be read ~24 cycles later

#define SBOX(a, b, c, d) { \
        uint32_t t; \
        t = (a); \
        (a) &= (c); \  (Stall)
        (a) ^= (d); \  (Stall)
        (c) ^= (b); \  
        (c) ^= (a); \  (Stall)
        (d) |= t; \
        (d) ^= (b); \  (Stall)
        t ^= (c); \
        (b) = (d); \
        (d) |= t; \   (Stall)
        (d) ^= (a); \ (Stall)
        (a) &= (b); \ (Stall)
        t ^= (a); \   (Stall)
        (b) ^= (d); \
        (b) ^= t; \  (Stall)
        (a) = (c); \
        (c) = (b); \ (Stall)
        (b) = (d); \ (Stall)
        (d) = SPH_T32(~t); \ (Stall)
    }

I wrote one got removed though...

{
uint32_t t;
t = a;
asm("and.b32 %0,%0,%1;" : "+r"(a) : "r"(c));
asm("xor.b32 %0,%0,%1;" : "+r"(a) : "r"(d));
asm("xor.b32 %0,%0,%1;" : "+r"(c) : "r"(b));
asm("xor.b32 %0,%0,%1;" : "+r"(c) : "r"(a));
asm( "or.b32 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b32 %0,%0,%1;" : "+r"(d) : "r"(b));
asm("xor.b32 %0,%0,%1;" : "+r"(t) : "r"(c));
b=d;
asm( "or.b32 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b32 %0,%0,%1;" : "+r"(d) : "r"(a));
asm("and.b32 %0,%0,%1;" : "+r"(a) : "r"(b));
asm("xor.b32 %0,%0,%1;" : "+r"(t) : "r"(a));
asm("xor.b32 %0,%0,%1;" : "+r"(b) : "r"(d));
asm("xor.b32 %0,%0,%1;" : "+r"(b) : "r"(t));
a=c;
c=b;
b=d;
asm("not.b32 %0,%1;" : "=r"(d) : "r"(t));
//asm("xor.b32 %0,%0,0xFFFFFFFF;" : "+r"(d));

I assumed it was the same using several asm statement or one with several line.
(knowing that I need to declare additional temp variable )
legendary
Activity: 1400
Merit: 1050
ccminer with whirlcoin support: https://github.com/djm34/ccminer

donation FrsvZzCqxkhQxfua31BggEeTdTXUcpy2JL
wondering how much does this take to compile? If I remember correctly last time around things took 2 hrs or so.  Cry

Edit: It always get stuck at z24x13_fugue512_.... what algo is that exactly?
lol, you didn't compile yet whirlpool, the wait isn't over...  Grin
(upgrade to 6.0 or 6.5 cuda version it will be faster)
legendary
Activity: 1400
Merit: 1050
djm34:
if you can tell me how to make
.reg .b32 %rhs;
shl.b32 %lhs, %r321, 17;
shr.b32 %rhs, %r321, 15;
add.u32 %r322, %lhs, %rhs;

in one line, I will be happy though...

Compile a new version with my changes, disassemble and you will see.



Did you try to tweak the register count? Fixed to 80 for all compute versions seems a bit strange.

Fatbin ptx code:
================
arch = sm_35
code version = [3,2]
producer = cuda
host = windows
compile_size = 32bit
compressed
identifier = C:/CCMiner/nvminer/x13/cuda_shabal512.cu
ptxasOptions = -v -abi=no -v -maxrregcount=80
You don't really need to disassemble, ptx files are written during compilations for each kernels...

regarding the maxregcount, well it is a maxregcount, if it doesn't require that much it will use less...
(and Christian did it  Grin)
I played a bit with it, the problem is that the big kernels don't like it that much.
With the current maxregcount, the occupancy is around 50% for most of the kernel and decreasing it, doesn't really increase the performance.
Actually it works only for whirlpool where using 64 reg, gives somewhat better performance (but I get lots of spilled bytes... )

Regarding Shabal, it uses 66 registers for an occupancy of 37.5%. I guess I could decrease a bit the register count. but as I said computing time spent on shabal, I just checked, is only 1.4% on gtx750ti and 1.% on 780ti (luffa 3.1% and 3.9%), actually shabal is already one of the fastest algorithm of the bunch...


sr. member
Activity: 1092
Merit: 254
ccminer with whirlcoin support: https://github.com/djm34/ccminer

donation FrsvZzCqxkhQxfua31BggEeTdTXUcpy2JL
wondering how much does this take to compile? If I remember correctly last time around things took 2 hrs or so.  Cry

Edit: It always get stuck at z24x13_fugue512_.... what algo is that exactly?
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
In the hamsi and the SBOX macro there seems to be Read-after-write register dependency stalls.


Register Dependency
Read-after-write register dependency
Instruction’s result can be read ~24 cycles later

#define SBOX(a, b, c, d) { \
        uint32_t t; \
        t = (a); \
        (a) &= (c); \  (Stall)
        (a) ^= (d); \  (Stall)
        (c) ^= (b); \  
        (c) ^= (a); \  (Stall)
        (d) |= t; \
        (d) ^= (b); \  (Stall)
        t ^= (c); \
        (b) = (d); \
        (d) |= t; \   (Stall)
        (d) ^= (a); \ (Stall)
        (a) &= (b); \ (Stall)
        t ^= (a); \   (Stall)
        (b) ^= (d); \
        (b) ^= t; \  (Stall)
        (a) = (c); \
        (c) = (b); \ (Stall)
        (b) = (d); \ (Stall)
        (d) = SPH_T32(~t); \ (Stall)
    }
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
djm34:
if you can tell me how to make
.reg .b32 %rhs;
shl.b32 %lhs, %r321, 17;
shr.b32 %rhs, %r321, 15;
add.u32 %r322, %lhs, %rhs;

in one line, I will be happy though...

Compile a new version with my changes, disassemble and you will see.



Did you try to tweak the register count? Fixed to 80 for all compute versions seems a bit strange.

Fatbin ptx code:
================
arch = sm_35
code version = [3,2]
producer = cuda
host = windows
compile_size = 32bit
compressed
identifier = C:/CCMiner/nvminer/x13/cuda_shabal512.cu
ptxasOptions = -v -abi=no -v -maxrregcount=80
legendary
Activity: 1400
Merit: 1050
cuda_x11_luffa512.cu:

#define MIXWORD(a0,a4)\
    a4 ^= a0;\
    a0  = (a0<<2) | (a0>>(30));\
    a0 ^= a4;\
    a4  = (a4<<14) | (a4>>(18));\
    a4 ^= a0;\
    a0  = (a0<<10) | (a0>>(22));\
    a0 ^= a4;\
    a4  = (a4<<1) | (a4>>(31));  

-------->

    a4 ^= a0;\
    a0  = SPH_ROTL32(a0, 2);\
    a0 ^= a4;\
    a4  = SPH_ROTL32(a4, 14);\
    a4 ^= a0;\
    a0  = SPH_ROTL32(a0, 10);\
    a0 ^= a4;\
    a4  = SPH_ROTL32(a0, 1);\;  

  

cuda_x11_cubehash512.cu:
  
#define ROTATEUPWARDS7(a) (((a) << 7) | ((a) >> 25))
#define ROTATEUPWARDS11(a) (((a) << 11) | ((a) >> 21))

------->

#define ROTATEUPWARDS7(a) (SPH_ROTL32(a, 7))
#define ROTATEUPWARDS11(a) (SPH_ROTL32(a, 11))

etc..
those I tried... no difference (unless they are in a big loop, it doesn't make a lot of difference)
legendary
Activity: 1400
Merit: 1050
djm34:

ccminer / x13 / cuda_shabal512.cu:

#define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm)    { \
      xa0 = T32((xa0 \
         ^ (((xa1 << 15) | (xa1 >> 17)) * 5U) \
         ^ xc) * 3U) \
         ^ xb1 ^ (xb2 & ~xb3) ^ xm; \
      xb0 = T32(~(((xb0 << 1) | (xb0 >> 31)) ^ xa0)); \
   }
   
Rewrite to:

#define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm)    { \
      xa0 = T32((xa0 \
         ^ ((SPH_ROTL32(xa1, 15) * 5U) \
         ^ xc) * 3U) \
         ^ xb1 ^ (xb2 & ~xb3) ^ xm; \
      xb0 = T32(~((SPH_ROTL32(xb0, 1) ^ xa0)); \
   }


I have disassembled the latest binary , still the funnel shift is not used in many of the algorithms x11/x13 etc . I am to lazy to make a build, and I don't have a maxwell card here to test.

(cuda_shabal512.cu: compute 3_5)

.reg .b32 %rhs;
shl.b32 %lhs, %r321, 17;
shr.b32 %rhs, %r321, 15;
add.u32 %r322, %lhs, %rhs;

I tried on some algo, but it doesn't make much difference.
Mostly because ccminer doesn't spend a lot of time on these algo.
Shabal represent something less than 5% of the overall time.

Things which needs improvement:
on 750ti: echo , groestl, whirlpool, hamsi (13%, 12.1%, 10.4%, 9.9% respectively)
on 780ti: hamsi, groestl, echo, fugue (15.9%; 12.5%; 12.1%; 7% resp.) whirlpool only 6.9%

(numbers are from a unreleased Xxx algo.  Grin)

So, sure one can certainly gain a little on luffa, shabal but it won't improve the overall perforamce of the algo.
(I tried on luffa actually...).
the card isn't bottlenecked by computing time (I mean calculation) but by reading those giant lookup tables (actually I am wondering if it wouldn't be faster to replace them by the original calculation... )

if you can tell me how to make
.reg .b32 %rhs;
shl.b32 %lhs, %r321, 17;
shr.b32 %rhs, %r321, 15;
add.u32 %r322, %lhs, %rhs;

in one line, I will be happy though...


sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
cuda_x11_luffa512.cu:

#define MIXWORD(a0,a4)\
    a4 ^= a0;\
    a0  = (a0<<2) | (a0>>(30));\
    a0 ^= a4;\
    a4  = (a4<<14) | (a4>>(18));\
    a4 ^= a0;\
    a0  = (a0<<10) | (a0>>(22));\
    a0 ^= a4;\
    a4  = (a4<<1) | (a4>>(31));   

-------->

    a4 ^= a0;\
    a0  = SPH_ROTL32(a0, 2);\
    a0 ^= a4;\
    a4  = SPH_ROTL32(a4, 14);\
    a4 ^= a0;\
    a0  = SPH_ROTL32(a0, 10);\
    a0 ^= a4;\
    a4  = SPH_ROTL32(a0, 1);\;   

   

cuda_x11_cubehash512.cu:
   
#define ROTATEUPWARDS7(a) (((a) << 7) | ((a) >> 25))
#define ROTATEUPWARDS11(a) (((a) << 11) | ((a) >> 21))

------->

#define ROTATEUPWARDS7(a) (SPH_ROTL32(a, 7))
#define ROTATEUPWARDS11(a) (SPH_ROTL32(a, 11))

etc..
full member
Activity: 247
Merit: 100
Does anyone mine JPC @ dwarfpool and get a lot of "booo's" ? Since three days or so (since the fork), I only get like 88-90% good shares shown in the console (ubuntu 14.04 x64). But the admin of dwarfpool checked my shares, and they are up to 100% accepted. Why does it show only ~90% good shares on my side, but on the pool everything I send is accepted? Never had that problem with JPC before.

Hi.
Mining there with 3x750ti, no booo's at all in my case. Using official ccminer v1.2, getting 5.2-5.5 mhs per card.

Same here! More than 10% of booos. Normally it was 0.5-1.0%.
Running 6x750Ti, winXP_32.
 
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
cuda_x11_luffa512.cu:

   define TWEAK(a0,a1,a2,a3,j)\
    a0 = (a0<<(j))|(a0>>(32-j));\
    a1 = (a1<<(j))|(a1>>(32-j));\
    a2 = (a2<<(j))|(a2>>(32-j));\
    a3 = (a3<<(j))|(a3>>(32-j));
   
------->

   define TWEAK(a0,a1,a2,a3,j)\
    a0 = SPH_ROTL32(a0, j);\
    a1 = SPH_ROTL32(a1, j);\
    a2 = SPH_ROTL32(a2, j);\
    a3 = SPH_ROTL32(a3, j);
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
djm34:

ccminer / x13 / cuda_shabal512.cu:

#define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm)    { \
      xa0 = T32((xa0 \
         ^ (((xa1 << 15) | (xa1 >> 17)) * 5U) \
         ^ xc) * 3U) \
         ^ xb1 ^ (xb2 & ~xb3) ^ xm; \
      xb0 = T32(~(((xb0 << 1) | (xb0 >> 31)) ^ xa0)); \
   }
   
Rewrite to:

#define PERM_ELT(xa0, xa1, xb0, xb1, xb2, xb3, xc, xm)    { \
      xa0 = T32((xa0 \
         ^ ((SPH_ROTL32(xa1, 15) * 5U) \
         ^ xc) * 3U) \
         ^ xb1 ^ (xb2 & ~xb3) ^ xm; \
      xb0 = T32(~((SPH_ROTL32(xb0, 1) ^ xa0)); \
   }


I have disassembled the latest binary , still the funnel shift is not used in many of the algorithms x11/x13 etc . I am to lazy to make a build, and I don't have a maxwell card here to test.

(cuda_shabal512.cu: compute 3_5)

.reg .b32 %rhs;
shl.b32 %lhs, %r321, 17;
shr.b32 %rhs, %r321, 15;
add.u32 %r322, %lhs, %rhs;

can be reduced to one instruction.
sr. member
Activity: 602
Merit: 250
Any plans to work in the split miner layout into the next few versions of ccminer or nvminer?
full member
Activity: 168
Merit: 100
I uploaded a new copy.  Give it another shot.

Let me know the outcome.

Carlo

Is this with whirlcoin also like djm34?

Just curios before downloading.

Yes and the description also gives hash rates for each algo.
legendary
Activity: 1400
Merit: 1050
I just wanted to say thanks for all the solid work guys on the new versions of ccminer, nvminer and that profit calculator. I just completed my AMD to nvidia conversion and wanted to share my progress. Many headaches but its now rocking and rolling. 900+ pages and going strong!

https://i.imgur.com/WyZVcYi.jpg




Nice,
I am also replacing my 280x with 750ti (40 to 65), will save like 1/2 of the power costs.


Here is a recommendation. Don't pick up any used cards on ebay. I got sent some cards that were problematic and they all came off ebay. Luckily I was able to swap them out for new ones due to Amazons lenient returns policy.

Thanks for the recommendation.
I only use new cards with warranty, had a series of 280x from Sapphire where Capacitors just blew after 2 weeks. Was no issue getting them refunded.
Smiley

lol, buying used card, you probably gets card which were used 24/7 non stop by miners  Grin
legendary
Activity: 1400
Merit: 1000
I uploaded a new copy.  Give it another shot.

Let me know the outcome.

Carlo

Is this with whirlcoin also like djm34?

Just curios before downloading.
crz
member
Activity: 116
Merit: 10
I uploaded a new copy.  Give it another shot.

Let me know the outcome.

Carlo

Working perfect, thanks!
full member
Activity: 168
Merit: 100
I uploaded a new copy.  Give it another shot.

Let me know the outcome.

Carlo
Jump to: