[ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 228.

tsiv

full member

Activity: 137

Merit: 100

Quote from: cayars on July 23, 2014, 07:37:45 AM

tsiv,

Wouldn't you want to have the block size a multiple of 32? Ie 32,64,96,128

Ye, full warps do sound tasty. We're starting to get there too. The launch config isn't exactly about threads per block anymore, the kernels are starting to use more than one thread per hash and the launch config is actually hashes per block and blocks per grid. For example the kernels I modified earlier are now running eight threads per hash, so they're actually already at full warp size at four hashes per block. The latest experimental build takes the slowest kernel that is running only a single thread per hash on the latest committed source and spreads it out between four threads per hash. Again, full warp at eight hashes per block while four hashes per block remains kinda iffy.

cayars

full member

Activity: 168

Merit: 100

Quote from: djm34 on July 23, 2014, 08:40:40 AM

x17 added to my github repository.
https://github.com/djm34/ccminer

windows binaries here: https://mega.co.nz/#!EEEElQ7Z!J77zXN1d6pTgHgGIhsJ1BzUkuE8IPyqS4_QyP7lm3Wk
(compîled with cuda 6.5)

ccminer -a x17

donation: XjPqpkCPoYJJYdQRrVByU7ySpVyeqJmSGU

CCMiner algos:
anime (C&C)
cryptonight (tsiv)
dmd-gr (Bombadil)
fresh (djm34)
fugue256 (C&C)
groestl (C&C)
heavy (C&C-based off reorder's cgminer code)
jackpot (C&C)
mjollnir (C&C-based off reorder's cgminer code)
myr-gr (C&C)
nist5 (C&C)
quark (C&C)
qubit (djm34)
Whirlcoin (djm34)
x11 (C&C)
x13 (C&C)
x14 (djm34)
x15 (djm34)
x17 (djm34)

1 Bombadil
1 tsiv
6 djm34
11 C&C

djm34 is on a massive roll!

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on July 23, 2014, 08:12:18 AM

Quote from: djm34 on July 23, 2014, 06:32:15 AM

the problem, is that it would be necessary to convert the entire algo in 64bit as conversion from 32 to 64bit are rather slow...
(won't happen this week)

But the reward could be significant.

From your previous comment:

Things which needs improvement:
on 750ti: echo , groestl, whirlpool, hamsi (13%, 12.1%, 10.4%, 9.9% respectively)
on 780ti: hamsi, groestl, echo, fugue (15.9%; 12.5%; 12.1%; 7% resp.) whirlpool only 6.9%

Keep up the good work.

yes but there is also new algo coming... too... Grin

djm34

legendary

Activity: 1400

Merit: 1050

x17 added to my github repository.
https://github.com/djm34/ccminer

windows binaries here: https://mega.co.nz/#!EEEElQ7Z!J77zXN1d6pTgHgGIhsJ1BzUkuE8IPyqS4_QyP7lm3Wk
(compîled with cuda 6.5)

ccminer -a x17

donation: XjPqpkCPoYJJYdQRrVByU7ySpVyeqJmSGU

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: djm34 on July 23, 2014, 06:32:15 AM

the problem, is that it would be necessary to convert the entire algo in 64bit as conversion from 32 to 64bit are rather slow...
(won't happen this week)

But the reward could be significant.

From your previous comment:

Things which needs improvement:
on 750ti: echo , groestl, whirlpool, hamsi (13%, 12.1%, 10.4%, 9.9% respectively)
on 780ti: hamsi, groestl, echo, fugue (15.9%; 12.5%; 12.1%; 7% resp.) whirlpool only 6.9%

Keep up the good work.

cayars

full member

Activity: 168

Merit: 100

Yea, lately that is true.

I think djm34 has as many if not algos in ccminer then Christian does now.

Carlo

EDIT:
CCMiner algos:
anime (C&C)
cryptonight (tsiv)
dmd-gr (Bombadil)
fresh (djm34)
fugue256 (C&C)
groestl (C&C)
heavy (C&C-based off reorder's cgminer code)
jackpot (C&C)
mjollnir (C&C-based off reorder's cgminer code)
myr-gr (C&C)
nist5 (C&C)
quark (C&C)
qubit (djm34)
Whirlcoin (djm34)
x11 (C&C)
x13 (C&C)
x14 (djm34)
x15 (djm34)

1 Bombadil
1 tsiv
5 djm34
11 C&C

Soon:
boolberry - C&C???
ppl - djm34???

Bombadil

hero member

Activity: 644

Merit: 500

Quote from: cayars on July 23, 2014, 07:50:54 AM

Hey Christian,

You taking a siesta? Grin

Christian is our Satoshi Nakamoto, if you know what I mean Cheesy

cayars

full member

Activity: 168

Merit: 100

Hey Christian,

You taking a siesta? Grin

cayars

full member

Activity: 168

Merit: 100

tsiv,

Wouldn't you want to have the block size a multiple of 32? Ie 32,64,96,128

tsiv

full member

Activity: 137

Merit: 100

Quote from: DrAlco on July 23, 2014, 06:23:21 AM

Quote from: tsiv on July 23, 2014, 12:27:44 AM

At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets.

https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zip

Improved hashrate of about 70H/s on a 780ti. Up from 320 to about 390 (using 8x60). Also doesn't seem to hang and bring the system to it's knees when using all GFX cards.

Seems to be in line with the ~18% improvements I saw when benchmarking only the AES part of the kernel. Have you tried other configs? 390 is still pretty low for a 780 Ti, I think people were getting best results with 4x120 on the 780 Ti.

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on July 23, 2014, 12:43:47 AM

Quote from: djm34 on July 22, 2014, 06:03:32 PM

Quote from: sp_ on July 22, 2014, 05:34:00 PM

Replace SBOX with sbox_pipelined

In the code:

SBOX(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18); \
   SBOX(hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
   SBOX(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A); \
   SBOX(hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
   SBOX(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C); \
   SBOX(hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
   SBOX(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E); \
   SBOX(hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \

------>

   sbox_pipelined(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18,hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
   sbox_pipelined(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A,hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
   sbox_pipelined(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C,hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
   sbox_pipelined(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E,hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \

ok I tried, but again it doesn't make a difference.

But it does when you convert the datastructure to 64 bit. Put hamsi_s00 in the 32bit upper part of the register, and ,hamsi_s01 in the lower part of the 64bit. then you will solve 2 times the data with the same assembly instructions that you had previously (but in 64bit).

uint64_t t;
t = a;
asm("and.b64 %0,%0,%1;" : "+r"(a) : "r"(c));
asm("xor.b64 %0,%0,%1;" : "+r"(a) : "r"(d));
asm("xor.b64 %0,%0,%1;" : "+r"(c) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(c) : "r"(a));
asm( "or.b64 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b64 %0,%0,%1;" : "+r"(d) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(t) : "r"(c));
b=d;
asm( "or.b64 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b64 %0,%0,%1;" : "+r"(d) : "r"(a));
asm("and.b64 %0,%0,%1;" : "+r"(a) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(t) : "r"(a));
asm("xor.b64 %0,%0,%1;" : "+r"(b) : "r"(d));
asm("xor.b64 %0,%0,%1;" : "+r"(b) : "r"(t));
a=c;
c=b;
b=d;
asm("not.b64 %0,%1;" : "=r"(d) : "r"(t));....

x13 / cuda_x13_hamsi512.cu /

#define ROUND_BIG(rc, alpha) { should be rewritten to operate on 64bit integers.

the problem, is that it would be necessary to convert the entire algo in 64bit as conversion from 32 to 64bit are rather slow...
(won't happen this week)

DrAlco

newbie

Activity: 43

Merit: 0

Quote from: tsiv on July 23, 2014, 12:27:44 AM

At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets.

https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zip

Improved hashrate of about 70H/s on a 780ti. Up from 320 to about 390 (using 8x60). Also doesn't seem to hang and bring the system to it's knees when using all GFX cards.

Bombadil

hero member

Activity: 644

Merit: 500

Quote from: tsiv on July 23, 2014, 12:27:44 AM

Welp. Managed to split the most offensive part of the kernel into four parallel threads per hash, result is spectacularly unimpressive. The best I've come up with breaks even with the current single thread per hash implementation. Well, almost. It's actually a percent slower AND loses compute 2.0 compatibility due to using shuffle. On the other hands it performs a lot more reasonably with various launch configurations, 15 blocks of 32 threads works our equally well as the original 8x60 magic bullet for 750 Ti.

At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets.

https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zip

Wolf0 also started on modding your ccminer-mod Cheesy

https://bitcointalksearch.org/topic/working-improved-cryptonight-cuda-miner-based-on-tsivs-work-701910

PVmining

sr. member

Activity: 330

Merit: 252

Quote from: tsiv on July 23, 2014, 12:27:44 AM

Welp. Managed to split the most offensive part of the kernel into four parallel threads per hash, result is spectacularly unimpressive.

Thanks for trying it!

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: djm34 on July 22, 2014, 06:03:32 PM

Quote from: sp_ on July 22, 2014, 05:34:00 PM

Replace SBOX with sbox_pipelined

In the code:

SBOX(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18); \
   SBOX(hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
   SBOX(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A); \
   SBOX(hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
   SBOX(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C); \
   SBOX(hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
   SBOX(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E); \
   SBOX(hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \

------>

   sbox_pipelined(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18,hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
   sbox_pipelined(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A,hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
   sbox_pipelined(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C,hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
   sbox_pipelined(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E,hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \

ok I tried, but again it doesn't make a difference.

But it does when you convert the datastructure to 64 bit. Put hamsi_s00 in the 32bit upper part of the register, and ,hamsi_s01 in the lower part of the 64bit. then you will solve 2 times the data with the same assembly instructions that you had previously (but in 64bit).

uint64_t t;
t = a;
asm("and.b64 %0,%0,%1;" : "+r"(a) : "r"(c));
asm("xor.b64 %0,%0,%1;" : "+r"(a) : "r"(d));
asm("xor.b64 %0,%0,%1;" : "+r"(c) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(c) : "r"(a));
asm( "or.b64 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b64 %0,%0,%1;" : "+r"(d) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(t) : "r"(c));
b=d;
asm( "or.b64 %0,%0,%1;" : "+r"(d) : "r"(t));
asm("xor.b64 %0,%0,%1;" : "+r"(d) : "r"(a));
asm("and.b64 %0,%0,%1;" : "+r"(a) : "r"(b));
asm("xor.b64 %0,%0,%1;" : "+r"(t) : "r"(a));
asm("xor.b64 %0,%0,%1;" : "+r"(b) : "r"(d));
asm("xor.b64 %0,%0,%1;" : "+r"(b) : "r"(t));
a=c;
c=b;
b=d;
asm("not.b64 %0,%1;" : "=r"(d) : "r"(t));....

x13 / cuda_x13_hamsi512.cu /

#define ROUND_BIG(rc, alpha) { should be rewritten to operate on 64bit integers.

tsiv

full member

Activity: 137

Merit: 100

Welp. Managed to split the most offensive part of the kernel into four parallel threads per hash, result is spectacularly unimpressive. The best I've come up with breaks even with the current single thread per hash implementation. Well, almost. It's actually a percent slower AND loses compute 2.0 compatibility due to using shuffle. On the other hands it performs a lot more reasonably with various launch configurations, 15 blocks of 32 threads works our equally well as the original 8x60 magic bullet for 750 Ti.

At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets.

https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zip

bitcoinvideos

sr. member

Activity: 251

Merit: 250

Just thought I'd say that DeepCoin is still hella mineable...very ninja type launch on Qubit algo...very under the radar

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on July 22, 2014, 05:34:00 PM

Replace SBOX with sbox_pipelined

In the code:

SBOX(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18); \
SBOX(hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
SBOX(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A); \
SBOX(hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
SBOX(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C); \
SBOX(hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
SBOX(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E); \
SBOX(hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \

------>

sbox_pipelined(hamsi_s00, hamsi_s08, hamsi_s10, hamsi_s18,hamsi_s01, hamsi_s09, hamsi_s11, hamsi_s19); \
sbox_pipelined(hamsi_s02, hamsi_s0A, hamsi_s12, hamsi_s1A,hamsi_s03, hamsi_s0B, hamsi_s13, hamsi_s1B); \
sbox_pipelined(hamsi_s04, hamsi_s0C, hamsi_s14, hamsi_s1C,hamsi_s05, hamsi_s0D, hamsi_s15, hamsi_s1D); \
sbox_pipelined(hamsi_s06, hamsi_s0E, hamsi_s16, hamsi_s1E,hamsi_s07, hamsi_s0F, hamsi_s17, hamsi_s1F); \

ok I tried, but again it doesn't make a difference.

djm34

legendary

Activity: 1400

Merit: 1050