Author

Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 225. (Read 3426989 times)

full member
Activity: 252
Merit: 102
OPEN Platform - Powering Blockchain Acceptance
Linux users might be pleased to know that the profit switching capability of ccManager is coming along nicely, too. It uses TradeMyBit for now, and I've just coded a facility to stop mining on TMB altogether if the daily profit projection is poor. In this case it switches to an alternative pool of your choice (last resort pool), or it stops mining altogether and monitors TMB for a decent profit margin before starting again.

I should have the gitHub updated with something for you to play with next week some time.
hero member
Activity: 868
Merit: 1000
ouch i am getting left behind, my mining rig has been off over a week and this thread just looks like a developer chatroom  Cool it is great to see so many of you all working together, who's this Christian guy that releases stuff? I've never seen him here  Wink

He is Nvidia Satoshi

Retire behind the scene
sr. member
Activity: 350
Merit: 250
ouch i am getting left behind, my mining rig has been off over a week and this thread just looks like a developer chatroom  Cool it is great to see so many of you all working together, who's this Christian guy that releases stuff? I've never seen him here  Wink
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
I haven't looked at TSIV's code. Isn't Cryptonite just a variation of x11 + scryptn? 20% gain is a good job. Now do another 20% Smiley

Anyway, I will start implementing some code soon. I will start with the 11 x'es. One by One.
sr. member
Activity: 330
Merit: 252
Just saw tsiv's parallelization of the second loop. Quite impressive.

...he's a cool guy.
Hey tsiv thanks a lot for your launch-config change for kopiemtu - that's really awesome!
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
I'm pretty sure the output of the last algorithm is used as the input of the next one for X11, precisely so you can't do that.
Yes you can. If each thread is working on a different hash.
Oh, I get it. Clever.
You're not going to raise the hash to that of the slowest alg, though, because the GPU is partially occupied by the other hashes going on. However, I see no reason why that won't work.

Yes, the GPU is occupied, but on seperate and non overlapping memory blocks.

The slowest alg can be optimized...
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
I'm pretty sure the output of the last algorithm is used as the input of the next one for X11, precisely so you can't do that.

Yes you can. If each thread is working on a different hash.

example
4 threads 4 hashes

HASH1: x1->x2->x3->
HASH2: x4->x5->x6->
HASH3: x7->x8->x9->
HASH4: x10->x11

Swap the 4 hashes

HASH4: x1->x2->x3->
HASH1: x4->x5->x6->
HASH2: x7->x8->x9->
HASH3: x10->x11

Swap the 4 hashes

HASH3: x1->x2->x3->
HASH4: x4->x5->x6->
HASH1: x7->x8->x9->
HASH2: x10->x11

Swap the 4 hashes

HASH2: x1->x2->x3->
HASH3: x4->x5->x6->
HASH4: x7->x8->x9->
HASH1: x10->x11

Complete

legendary
Activity: 3248
Merit: 1070
Welp. Managed to split the most offensive part of the kernel into four parallel threads per hash, result is spectacularly unimpressive. The best I've come up with breaks even with the current single thread per hash implementation. Well, almost. It's actually a percent slower AND loses compute 2.0 compatibility due to using shuffle. On the other hands it performs a lot more reasonably with various launch configurations, 15 blocks of 32 threads works our equally well as the original 8x60 magic bullet for 750 Ti.

At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets.

https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zip

Also, any chances for this code to get released already? Or are you competing against Wolf0 Cheesy
It works like a charm, 220H/s for GTX760, before it was 190. GTX750TIs seem unchanged.



I get 270H(peaks of 297H with -l 8x50)  with this release and a GTX 760 overclocked -->v0.15-rc1 ccminer-cryptonight_20140723

Thanks for that launch setting Cheesy 306H/s (MSI gaming, +180core, +500mem). Still have to test what's the most stable, but thanks for giving me a start Wink

Ooh damn, you've released that a looong time ago, tsiv. Should've noticed ^^"

EDIT: 320H/s with +222core, +666mem Tongue I'm waiting anxiously for a driver crash Wink

Fantastic Bombadill...im on +180 core +300 Memory
If you find any better launch configs please post it
I also asked in the other thread if there are binaries for wolf nvidia xmr miner

dunno but i can't oc my cards at least one of them keeps crashing if i do so, you changed the power limit in the bios?
hero member
Activity: 868
Merit: 1000

The target for the optimized GPU miner wil be in the range 5-7.5 MHASH on the 750TI for x11(darkcoin).


WOw. This is big if it's true. It will more than double the speed of pretty much every algo that ccminer is using. Very big improvement.

This is the kind of improvement that is ground breaking !

cbuchner1 aka Nvidia Satoshi , Any comment about sp_ theory ?
hero member
Activity: 672
Merit: 500
Banned: For Your Protection
So I take back that 64 bit builds are faster with x15. I just have so many different versions of ccminer at this point (~20 GB) that I ended up using a borked 32 bit version which used the GPU less and the CPU more than it's 64 bit brother.

What we need is 32 bit adressing with 64 bit hashing. (Use 100% of the cudacores per cycle instead of 50%). This is not done by changing the build target to 64bit. Each hash needs to be re-implemented in CUDA-asm from scratch.
Compute 3.0 has max 64 32bit registers per thread, compute 3.5 has 255 registers etc. But there are no speedups when compiling ccminer for 5.0. This meens that the code generated is suboptimal and needs to be finetuned (preferably in 100% Cuda asm).
Remove latency, remove registers, pipeline instructions, improve cachehits etc..

Today each thread in ccminer is computing 1 hash by doing a full runtthtrough of all algorithms:

x1->x2->x3->x4->x5->x6->x7->x8->x9->x10->x11

This is suboptimal

A FPGA implementation will run at the speed of the slowest x, thus eliminating the other x'es since they are done in parallell. We should do something similar.

The slowest algorithm for the 750TI is the groestl. This algorithm is running at 7,5 MHASH on a single 750TI.

The target for the optimized GPU miner wil be in the range 5-7.5 MHASH on the 750TI for x11(darkcoin).


Now you're gettin' serious... and I like that!!  Grin
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
So I take back that 64 bit builds are faster with x15. I just have so many different versions of ccminer at this point (~20 GB) that I ended up using a borked 32 bit version which used the GPU less and the CPU more than it's 64 bit brother.

What we need is 32 bit adressing with 64 bit hashing. (Use 100% of the cudacores per cycle instead of 50%). This is not done by changing the build target to 64bit. Each hash needs to be re-implemented in CUDA-asm from scratch.
Compute 3.0 has max 64 32bit registers per thread, compute 3.5 has 255 registers etc. But there are no speedups when compiling ccminer for 5.0. This meens that the code generated is suboptimal and needs to be finetuned (preferably in 100% Cuda asm).
Remove latency, remove registers, pipeline instructions, improve cachehits etc..

Today each thread in ccminer is computing 1 hash by doing a full runtthtrough of all algorithms:

x1->x2->x3->x4->x5->x6->x7->x8->x9->x10->x11

This is suboptimal

A FPGA implementation will run at the speed of the slowest x, thus eliminating the other x'es since they are done in parallell. We should do something similar.

The slowest algorithm for the 750TI is the groestl. This algorithm is running at 7,5 MHASH on a single 750TI.

The target for the optimized GPU miner wil be in the range 5-7.5 MHASH on the 750TI for x11(darkcoin).
newbie
Activity: 29
Merit: 0
I know this really doesn't have to do with ccminer discussions as much as nVidia mining/hashing in general, but I figured a good deal of people come to this thread to discuss the most profitable way to use nVidia power, so here goes: I've done a little trial run over the past day and a half of using Folding@Home to "mine" Curecoins, and I'm getting a pretty good payout (payout should be at about 30-35 or so Curecoins/day for my card's PPD; roughly 0.0035-0.0042 BTC/day at the current rate).
It certainly doesn't hurt that my GPU is folding proteins and helping researchers while using all that power, instead of doing random hashing, but even without those considerations, the profit margin speaks for itself (for reference, Bombadil's profit calculator shows my 3 most profitable mining options at: {TAG: VEIL | Name:Veilcoin | Algo: X13 | BTC/day: .00316889, TAG: PP9X11 | Name:Multipool X11 (PP) | Algo: X11 | BTC/day: .00302276, TAG: XMR | Name:Monero | Algo: CryptoNight | BTC/day: .00215359}).

I get about 250k PPD with my 780 Ti and i5-4670k, so folding might be more relevant to single-card/gaming rigs moreso than pure mining rigs, but looking into F@H couldn't hurt for other kinds of rigs (I'd be interested to see what a full 750 Ti rig with a mid-range processor could put out in terms of PPD)!
full member
Activity: 137
Merit: 100
Something I pretty much suspected but never bothered to check up on, run times for the various parts of the hash. Well, actually I did benchmark the core loops earlier and found the second one to be the biggest hog. Throw in the numbers for the prep and final phases and you get this:

Prepare: 0.001388 sec
Phase 1: 0.148383 sec
Phase 2: 1.414880 sec
Phase 3: 0.147834 sec
Final: 0.003590 sec

That's 32x15 hashes on a GTX 750 Ti. Can't tell how it works out on other cards since all I've got is a bunch of 750 Tis, but in this case optimizing the living fuck out of the prep and final parts all the way to instant completion with zero run time would bump up the total hashrate by 0.3%. Don't get me wrong, Wolf's doing nice work on unfucking stuff I pretty much just yanked out of cpuminer-multi and left as is. I just prefer to focus on shit that matters, again, no offense intended. Too bad I'm not even making a dent on that goddamn clusterfuck that is the second main loop Grin
legendary
Activity: 2002
Merit: 1051
ICO? Not even once.
So I take back that 64 bit builds are faster with x15. I just have so many different versions of ccminer at this point (~20 GB) that I ended up using a borked 32 bit version which used the GPU less and the CPU more than it's 64 bit brother.

Anyway, here are the average hasrates of a 750Ti and a 780Ti rig per card, running a couple of minutes per algo with djm34's commit 58 compiled with cuda 5.5:
Code:
  750 Ti          780 Ti
x32 x64 x32 x64
x11 2.4 2.3 5.5 5.2
x13 2.0 1.8 4.0 3.8
x14 1.9 1.8 4.0 3.8
x15 1.7 1.6 3.7 3.5
x17 1.6 1.5 3.6 3.4
jackpot 5.0 5.1 11 11
qubit 4.0 3.7 8.9 8.1
nist5 7.7 7.7 16.3 16.4
fresh 3.1 2.8 7.2 6.2
groestl 7.3 7.3 14.5 14.7
Gigabyte cards, solomining, very slight 60mhz core overclock.
full member
Activity: 137
Merit: 100
Note to self: __CUDA_ARCH__ is a fickle bitch.

I think I got the damn thing to use the new 4-way version of the phase 2 kernel for compute 3.0+ and the old one for 2.0. Since __CUDA_ARCH__ is apparently not defined when compiling the host code I didn't see much choice but to fire up the kernel with four threads per hash even if it's the single thread per hash compute 2.0 version. Dealt with it by making the single thread kernel do work only on the first of the four subthreads. Not very happy with it but it doesn't seem to matter that much performance-wise.

Bottom line: Fuck all difference on Maxwell, apparently some other compute 3.0+ cards like the new 4-way kernel and gain some performance, compute 2.0 should work like before.

I'll look into pulling some of Wolf's mods, also got some ideas for the phase 1&3 kernels but we'll see.

Win32 binary at https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15/ccminer-cryptonight_20140724.zip
hero member
Activity: 868
Merit: 1000
Hey guys,

Just wanted to give you an update.  I was compiling a new version of nvMiner with djm34's X17 compiled in.

He changed his github at about 3 hours into the compile. I killed the compile and I pulled down his latest and started compiling again.  Not sure what went wrong but my VS project file got messed up.  After 6 hours compiling with CUDA 5.5 I killed it.

I reverted back to my earlier version and I'm starting over to keep things clean on my side.  If this was an algo that was profitable I'd could have done a quick compile with just that algo to get it out the door for you guys to use, but since there really is no reason to mine this coin (maybe a clone in the future) I see no reason to do it "wrong".  So I'm going to take my time with it and get it right.

Right now PPL/X17 isn't worth mining so there is no reason to push a release other than to have a nvidia miner that supports one more algo. I'm not into "bragging rights on number of algos supported" if it doesn't make sense.

I'm also going to do some bench marking of stuff sp_ has published recently in the last 5 to 10 pages.  Plus if TSIV doesn't release source code to his recent mods to CryptoNight I'm going to also benchmark Wolf's changes one by one and include what I find to benefit us.

So long story short, I'm going to delay the next release until I'm ready and have done some testing.  The next nvMiner will have:
1) x17
2) any speed up proposed by sp_ or Wolf if they pan out for either Kepler or Maxwell.
3) If TSIV does a source release I'll include this also.
(should be 24 hours or less)

Also, since CUDA 6.5 is right around the corner from release using 5.5 will basically be 2 versions behind.  There comes a point when it's not worth supporting older software and I think we are getting there.  The next nvMiner WILL SUPPORT 5.5 but I don't know about future releases.

CUDA 6.0 (3.0/3.5/5.0) compiles a lot faster then 5.5 does with both 3.0/3.5.  So in the future we will move to 6.0 for nvMiner when all algos have been tested on both Maxwell and Kepler and work ok.  Right now (or last test) I had a problem with FRESH on 6.0.

I delay release of all nvMiner releases until I test EVERY algo after each build.  Damn you djm34 because you are starting to make my testing time take longer. Smiley

So moral of this post is to start upgrading your Rigs to use the latest nvidia drivers.  For the last 6 months (at least) I've been running the latest beta drivers at all times with no problems at all on both Maxwell and Kepler GPUs. So I see no reason not to run the latest or beta releases (what I run).

I'll compile this next version as 5.5 and probably release the version after this as 6.0 first then 5.5 and the 3rd version from now might very well be 6.0 or greater only.

SO I JUST WANTED TO GIVE A heads UP on my plans up move to CUDA 6.0 which is a normal release and not beta.  If during testing I find this performs worse I'll let you guys know and will re-think this (we want highest hash rates of course).

So start thinking or doing upgrades to the latest nvidia driver releases.

Carlo


Thank you very much.

I think we should seriously think about Nvidia Miner Foundation and a foundation donation address.
full member
Activity: 168
Merit: 100
Hey guys,

Just wanted to give you an update.  I was compiling a new version of nvMiner with djm34's X17 compiled in.

He changed his github at about 3 hours into the compile. I killed the compile and I pulled down his latest and started compiling again.  Not sure what went wrong but my VS project file got messed up.  After 6 hours compiling with CUDA 5.5 I killed it.

I reverted back to my earlier version and I'm starting over to keep things clean on my side.  If this was an algo that was profitable I'd could have done a quick compile with just that algo to get it out the door for you guys to use, but since there really is no reason to mine this coin (maybe a clone in the future) I see no reason to do it "wrong".  So I'm going to take my time with it and get it right.

Right now PPL/X17 isn't worth mining so there is no reason to push a release other than to have a nvidia miner that supports one more algo. I'm not into "bragging rights on number of algos supported" if it doesn't make sense.

I'm also going to do some bench marking of stuff sp_ has published recently in the last 5 to 10 pages.  Plus if TSIV doesn't release source code to his recent mods to CryptoNight I'm going to also benchmark Wolf's changes one by one and include what I find to benefit us.

So long story short, I'm going to delay the next release until I'm ready and have done some testing.  The next nvMiner will have:
1) x17
2) any speed up proposed by sp_ or Wolf if they pan out for either Kepler or Maxwell.
3) If TSIV does a source release I'll include this also.
(should be 24 hours or less)

Also, since CUDA 6.5 is right around the corner from release using 5.5 will basically be 2 versions behind.  There comes a point when it's not worth supporting older software and I think we are getting there.  The next nvMiner WILL SUPPORT 5.5 but I don't know about future releases.

CUDA 6.0 (3.0/3.5/5.0) compiles a lot faster then 5.5 does with both 3.0/3.5.  So in the future we will move to 6.0 for nvMiner when all algos have been tested on both Maxwell and Kepler and work ok.  Right now (or last test) I had a problem with FRESH on 6.0.

I delay release of all nvMiner releases until I test EVERY algo after each build.  Damn you djm34 because you are starting to make my testing time take longer. Smiley

So moral of this post is to start upgrading your Rigs to use the latest nvidia drivers.  For the last 6 months (at least) I've been running the latest beta drivers at all times with no problems at all on both Maxwell and Kepler GPUs. So I see no reason not to run the latest or beta releases (what I run).

I'll compile this next version as 5.5 and probably release the version after this as 6.0 first then 5.5 and the 3rd version from now might very well be 6.0 or greater only.

SO I JUST WANTED TO GIVE A heads UP on my plans up move to CUDA 6.0 which is a normal release and not beta.  If during testing I find this performs worse I'll let you guys know and will re-think this (we want highest hash rates of course).

So start thinking or doing upgrades to the latest nvidia driver releases.

Carlo
full member
Activity: 161
Merit: 100
Not sure if this is what you mean't djm, but you can add the files you want git to ignore to the .gitignore file in your repo (If you have one, pretty easy to setup) and you won't have to worry about that.
member
Activity: 112
Merit: 10
I'm still working on stabilizing the rest of my optimizations, so I'm not creating binaries yet.

Thanks a ton for your efforts

KEEP US UPDATED  Grin Grin

EDIT: Tsiv i sent you a tiny btc donation :PPP
member
Activity: 112
Merit: 10
Welp. Managed to split the most offensive part of the kernel into four parallel threads per hash, result is spectacularly unimpressive. The best I've come up with breaks even with the current single thread per hash implementation. Well, almost. It's actually a percent slower AND loses compute 2.0 compatibility due to using shuffle. On the other hands it performs a lot more reasonably with various launch configurations, 15 blocks of 32 threads works our equally well as the original 8x60 magic bullet for 750 Ti.

At this point I'm starting to think I'll just forget about that part and start looking if there's something else to be improved. I'm still curious as to how it runs on other hardware, so if a couple of gents on Win boxes with something else than a 750 Ti in would be willing to take it for a spin, I'd appreciate it. I've added the number for SMX/SMM/Whateverthingmabobs into the miner thread start-up info, you'll probably find your card performing best when the block count is a multiple of the SMX count and the number of threads a power of 2. 4/8/16/32/64 are the best bets.

https://github.com/tsiv/ccminer-cryptonight/releases/download/v0.15-rc1/ccminer-cryptonight_20140723_exp.zip

Also, any chances for this code to get released already? Or are you competing against Wolf0 Cheesy
It works like a charm, 220H/s for GTX760, before it was 190. GTX750TIs seem unchanged.



I get 270H(peaks of 297H with -l 8x50)  with this release and a GTX 760 overclocked -->v0.15-rc1 ccminer-cryptonight_20140723

Thanks for that launch setting Cheesy 306H/s (MSI gaming, +180core, +500mem). Still have to test what's the most stable, but thanks for giving me a start Wink

Ooh damn, you've released that a looong time ago, tsiv. Should've noticed ^^"

EDIT: 320H/s with +222core, +666mem Tongue I'm waiting anxiously for a driver crash Wink

Fantastic Bombadill...im on +180 core +300 Memory
If you find any better launch configs please post it
I also asked in the other thread if there are binaries for wolf nvidia xmr miner
Jump to: