Author

Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 228. (Read 3426989 times)

full member
Activity: 137
Merit: 100
tsiv,

Wouldn't you want to have the block size a multiple of 32?  Ie 32,64,96,128

Ye, full warps do sound tasty. We're starting to get there too. The launch config isn't exactly about threads per block anymore, the kernels are starting to use more than one thread per hash and the launch config is actually hashes per block and blocks per grid. For example the kernels I modified earlier are now running eight threads per hash, so they're actually already at full warp size at four hashes per block. The latest experimental build takes the slowest kernel that is running only a single thread per hash on the latest committed source and spreads it out between four threads per hash. Again, full warp at eight hashes per block while four hashes per block remains kinda iffy.
full member
Activity: 168
Merit: 100
x17 added to my github repository.
https://github.com/djm34/ccminer

windows binaries here: https://mega.co.nz/#!EEEElQ7Z!J77zXN1d6pTgHgGIhsJ1BzUkuE8IPyqS4_QyP7lm3Wk
(compîled with cuda 6.5)

ccminer -a x17

donation: XjPqpkCPoYJJYdQRrVByU7ySpVyeqJmSGU



CCMiner algos:
anime (C&C)
cryptonight (tsiv)
dmd-gr (Bombadil)
fresh (djm34)
fugue256 (C&C)
groestl (C&C)
heavy (C&C-based off reorder's cgminer code)
jackpot (C&C)
mjollnir (C&C-based off reorder's cgminer code)
myr-gr (C&C)
nist5 (C&C)
quark (C&C)
qubit (djm34)
Whirlcoin (djm34)
x11 (C&C)
x13 (C&C)
x14 (djm34)
x15 (djm34)
x17 (djm34)

1 Bombadil
1 tsiv
6 djm34
11 C&C

djm34 is on a massive roll!
legendary
Activity: 1400
Merit: 1050
the problem, is that it would be necessary to convert the entire algo in 64bit as conversion from 32 to 64bit are rather slow...
(won't happen this week)

But the reward could be significant. Smiley


From  your previous comment:

Things which needs improvement:
on 750ti: echo , groestl, whirlpool, hamsi (13%, 12.1%, 10.4%, 9.9% respectively)
on 780ti: hamsi, groestl, echo, fugue (15.9%; 12.5%; 12.1%; 7% resp.) whirlpool only 6.9%

Keep up the good work.
yes but there is also new algo coming... too...  Grin
legendary
Activity: 1400
Merit: 1050
x17 added to my github repository.
https://github.com/djm34/ccminer

windows binaries here: https://mega.co.nz/#!EEEElQ7Z!J77zXN1d6pTgHgGIhsJ1BzUkuE8IPyqS4_QyP7lm3Wk
(compîled with cuda 6.5)

ccminer -a x17

donation: XjPqpkCPoYJJYdQRrVByU7ySpVyeqJmSGU

sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
the problem, is that it would be necessary to convert the entire algo in 64bit as conversion from 32 to 64bit are rather slow...
(won't happen this week)

But the reward could be significant. Smiley


From  your previous comment:

Things which needs improvement:
on 750ti: echo , groestl, whirlpool, hamsi (13%, 12.1%, 10.4%, 9.9% respectively)
on 780ti: hamsi, groestl, echo, fugue (15.9%; 12.5%; 12.1%; 7% resp.) whirlpool only 6.9%

Keep up the good work.
full member
Activity: 168
Merit: 100
Yea, lately that is true. Smiley

I think djm34 has as many if not algos in ccminer then Christian does now.

Carlo

EDIT:
CCMiner algos:
anime (C&C)
cryptonight (tsiv)
dmd-gr (Bombadil)
fresh (djm34)
fugue256 (C&C)
groestl (C&C)
heavy (C&C-based off reorder's cgminer code)
jackpot (C&C)
mjollnir (C&C-based off reorder's cgminer code)
myr-gr (C&C)
nist5 (C&C)
quark (C&C)
qubit (djm34)
Whirlcoin (djm34)
x11 (C&C)
x13 (C&C)
x14 (djm34)
x15 (djm34)

1 Bombadil
1 tsiv
5 djm34
11 C&C

Soon:
boolberry - C&C???
ppl - djm34???
hero member
Activity: 644
Merit: 500
Hey Christian,

You taking a siesta?  Grin

Christian is our Satoshi Nakamoto, if you know what I mean Cheesy
full member
Activity: 168
Merit: 100
Hey Christian,

You taking a siesta?  Grin
full member
Activity: 168
Merit: 100
tsiv,

Wouldn't you want to have the block size a multiple of 32?  Ie 32,64,96,128
full member
Activity: 137
Merit: 100
At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets.

https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zip

Improved hashrate of about 70H/s on a 780ti. Up from 320 to about 390 (using 8x60). Also doesn't seem to hang and bring the system to it's knees when using all GFX cards.

Seems to be in line with the ~18% improvements I saw when benchmarking only the AES part of the kernel. Have you tried other configs? 390 is still pretty low for a 780 Ti, I think people were getting best results with 4x120 on the 780 Ti.
legendary
Activity: 1400
Merit: 1050
Replace SBOX with sbox_pipelined

In the code:

SBOX(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18); \
      SBOX(hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
      SBOX(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A); \
      SBOX(hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
      SBOX(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C); \
      SBOX(hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
      SBOX(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E); \
      SBOX(hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \


------>

   sbox_pipelined(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18,hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
   sbox_pipelined(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A,hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
   sbox_pipelined(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C,hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
   sbox_pipelined(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E,hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \

ok I tried, but again it doesn't make a difference.

But it does when you convert the datastructure to 64 bit. Put hamsi_s00 in the 32bit upper part of the register, and ,hamsi_s01 in the lower part of the 64bit. then you will solve 2 times the data with the same assembly instructions that you had previously (but in 64bit).


uint64_t t;
t = a;
asm("and.b64 %0,%0,%1;" : "+r"(a) : "r"(c));
asm("xor.b64 %0,%0,%1;" : "+r"(a) : "r"(d));
asm("xor.b64 %0,%0,%1;" : "+r"(c) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(c) : "r"(a));
asm( "or.b64 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b64 %0,%0,%1;" : "+r"(d) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(t) : "r"(c));
b=d;
asm( "or.b64 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b64 %0,%0,%1;" : "+r"(d) : "r"(a));
asm("and.b64 %0,%0,%1;" : "+r"(a) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(t) : "r"(a));
asm("xor.b64 %0,%0,%1;" : "+r"(b) : "r"(d));
asm("xor.b64 %0,%0,%1;" : "+r"(b) : "r"(t));
a=c;
c=b;
b=d;
asm("not.b64 %0,%1;" : "=r"(d) : "r"(t));....


x13 / cuda_x13_hamsi512.cu /

#define ROUND_BIG(rc, alpha) { should be rewritten to operate on 64bit integers.


the problem, is that it would be necessary to convert the entire algo in 64bit as conversion from 32 to 64bit are rather slow...
(won't happen this week)
newbie
Activity: 43
Merit: 0
At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets.

https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zip

Improved hashrate of about 70H/s on a 780ti. Up from 320 to about 390 (using 8x60). Also doesn't seem to hang and bring the system to it's knees when using all GFX cards.
hero member
Activity: 644
Merit: 500
Welp. Managed to split the most offensive part of the kernel into four parallel threads per hash, result is spectacularly unimpressive. The best I've come up with breaks even with the current single thread per hash implementation. Well, almost. It's actually a percent slower AND loses compute 2.0 compatibility due to using shuffle. On the other hands it performs a lot more reasonably with various launch configurations, 15 blocks of 32 threads works our equally well as the original 8x60 magic bullet for 750 Ti.

At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets.

https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zip

Wolf0 also started on modding your ccminer-mod Cheesy https://bitcointalksearch.org/topic/working-improved-cryptonight-cuda-miner-based-on-tsivs-work-701910
sr. member
Activity: 330
Merit: 252
Welp. Managed to split the most offensive part of the kernel into four parallel threads per hash, result is spectacularly unimpressive.

Thanks for trying it!
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Replace SBOX with sbox_pipelined

In the code:

SBOX(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18); \
      SBOX(hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
      SBOX(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A); \
      SBOX(hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
      SBOX(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C); \
      SBOX(hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
      SBOX(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E); \
      SBOX(hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \


------>

   sbox_pipelined(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18,hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
   sbox_pipelined(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A,hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
   sbox_pipelined(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C,hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
   sbox_pipelined(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E,hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \

ok I tried, but again it doesn't make a difference.

But it does when you convert the datastructure to 64 bit. Put hamsi_s00 in the 32bit upper part of the register, and ,hamsi_s01 in the lower part of the 64bit. then you will solve 2 times the data with the same assembly instructions that you had previously (but in 64bit).


uint64_t t;
t = a;
asm("and.b64 %0,%0,%1;" : "+r"(a) : "r"(c));
asm("xor.b64 %0,%0,%1;" : "+r"(a) : "r"(d));
asm("xor.b64 %0,%0,%1;" : "+r"(c) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(c) : "r"(a));
asm( "or.b64 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b64 %0,%0,%1;" : "+r"(d) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(t) : "r"(c));
b=d;
asm( "or.b64 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b64 %0,%0,%1;" : "+r"(d) : "r"(a));
asm("and.b64 %0,%0,%1;" : "+r"(a) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(t) : "r"(a));
asm("xor.b64 %0,%0,%1;" : "+r"(b) : "r"(d));
asm("xor.b64 %0,%0,%1;" : "+r"(b) : "r"(t));
a=c;
c=b;
b=d;
asm("not.b64 %0,%1;" : "=r"(d) : "r"(t));....


x13 / cuda_x13_hamsi512.cu /

#define ROUND_BIG(rc, alpha) { should be rewritten to operate on 64bit integers.

full member
Activity: 137
Merit: 100
Welp. Managed to split the most offensive part of the kernel into four parallel threads per hash, result is spectacularly unimpressive. The best I've come up with breaks even with the current single thread per hash implementation. Well, almost. It's actually a percent slower AND loses compute 2.0 compatibility due to using shuffle. On the other hands it performs a lot more reasonably with various launch configurations, 15 blocks of 32 threads works our equally well as the original 8x60 magic bullet for 750 Ti.

At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets.

https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zip
sr. member
Activity: 251
Merit: 250
Just thought I'd say that DeepCoin is still hella mineable...very ninja type launch on Qubit algo...very under the radar
legendary
Activity: 1400
Merit: 1050
Replace SBOX with sbox_pipelined

In the code:

SBOX(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18); \
      SBOX(hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
      SBOX(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A); \
      SBOX(hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
      SBOX(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C); \
      SBOX(hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
      SBOX(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E); \
      SBOX(hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \


------>

   sbox_pipelined(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18,hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
   sbox_pipelined(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A,hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
   sbox_pipelined(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C,hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
   sbox_pipelined(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E,hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \

ok I tried, but again it doesn't make a difference.
legendary
Activity: 1400
Merit: 1050
which the password to download the x15 file - 07/15/2014?

DA4AF09FE5377715856BA0B10A29C95867053ECBF4105DBDD8957DA78B4127E49E4717DD667CEEF B

Don't understand why nobody remember it...  Grin

not and this is not
damn it, I can't remember either  Grin

 Embarrassed
I never put any password anywhere... (not sure what you downloaded actually...)
sr. member
Activity: 311
Merit: 250
which the password to download the x15 file - 07/15/2014?

DA4AF09FE5377715856BA0B10A29C95867053ECBF4105DBDD8957DA78B4127E49E4717DD667CEEF B

Don't understand why nobody remember it...  Grin

not and this is not
damn it, I can't remember either  Grin

 Embarrassed
Jump to: