Pages:
Author

Topic: NVIDIA Kepler (K20) from 134MHash/s to 330MHash/s with CUDA - page 10. (Read 73331 times)

hero member
Activity: 507
Merit: 500
Finally CUDA is getting a bit more attention from developers.

I am currently doing similar optimization work for the scrypt hashing used in Litecoin. About doubled the performance I am getting from most of my cards compared to OpenCL miners. This still sucks big time when compared to ATI cards, but it sucks a bit less than before.

With the scrypt hashing it appears much more difficult to lower the kernel's register count, as the required Salsa20/8 rounds are fairly complex beasts, also the memory-hard part of the algorithm really bangs on the memory controller.

Watch out for potential Windows binary releases in the next days. I will post into the alt cryptocurrency forum.


for ltc or btc?
hero member
Activity: 756
Merit: 502
Finally CUDA is getting a bit more attention from developers.

I am currently doing similar optimization work for the scrypt hashing used in Litecoin. About doubled the performance I am getting from most of my cards compared to OpenCL miners. This still sucks big time when compared to ATI cards, but it sucks a bit less than before.

With the scrypt hashing it appears much more difficult to lower the kernel's register count, as the required Salsa20/8 rounds are fairly complex beasts, also the memory-hard part of the algorithm really bangs on the memory controller.

Watch out for potential Windows binary releases in the next days. I will post into the alt cryptocurrency forum.
full member
Activity: 167
Merit: 100
Damn exciting, i have a 460 that needs a workout!
newbie
Activity: 49
Merit: 0
Bas news!!!

With K20 we can only get theoreticly 372 MHash/s.

We can run 120 bitshifts and 160 xor,and,add,rotate per clockcycle and multiprocessor. The asm code inside of the loop has 3725 and,or,.. and 162 bitshifts. K20 has 705 MHz and 13 multiprocessors. 705*13/(3725/160+162/120)=372

With GTX Titen we can get 475 MHash/s (theoreticly) because clock rate is 837 MHz and we have 14 multiprocessors.

All this calculations are without the caculation overhead before the loop and I am not really shure if 160 or 120 rotates can caculated per clock. At the moment I only get out real 330 MHash/s :-(
Keep in mind that this optimization not get a factor of two with old gpus, only with sm_35 (new Kepler GPUs)
hero member
Activity: 507
Merit: 500
I will give  .25BTC for a Windows version to be created and maintained.... I have 37 nvidia cards that are BEGGING TO BE WORKED

I'll compile it for Windows, if it compiles for Windows.

EDIT: Ugh, cmake. I have no idea how to cross-compile this.

Same issue I had
member
Activity: 81
Merit: 1002
It was only the wind.
I will give  .25BTC for a Windows version to be created and maintained.... I have 37 nvidia cards that are BEGGING TO BE WORKED

I'll compile it for Windows, if it compiles for Windows.

EDIT: Ugh, cmake. I have no idea how to cross-compile this.
legendary
Activity: 1493
Merit: 1003
Based on that, do you believe your changes should provide benefit to all nVidia hardware, not just Kepler-based boards?

The shift functions are only available on Kepler based GPU's, -but- the other optimizations he has worked in there could give non-Kepler based cards the ~200Mhash performance.

I'd love to test this on my rig but I don't know if it would be ever possible since it has an onboard GeForce 8200 that used to give ~20Mh/s, but never been able to use or even see it after plugging an ATI Radeon 5550.

Would it be any way I could "see" this device listed on lspci and have it working to mine with this patch?
hero member
Activity: 507
Merit: 500
I will give  .25BTC for a Windows version to be created and maintained.... I have 37 nvidia cards that are BEGGING TO BE WORKED
sr. member
Activity: 415
Merit: 250
Money is the root of all evil.

I would appreciate if you could compile the Windows' binaries..
Most nvidia users are on Windows due to well, directx..
hero member
Activity: 914
Merit: 500
Based on that, do you believe your changes should provide benefit to all nVidia hardware, not just Kepler-based boards?

The shift functions are only available on Kepler based GPU's, -but- the other optimizations he has worked in there could give non-Kepler based cards the ~200Mhash performance.
sr. member
Activity: 359
Merit: 250
Soo... still about the same as a $75 5830. GJ!
True, but this is pretty big for people who already own Nvidia cards and can now mine more efficiently with them.
legendary
Activity: 952
Merit: 1000
Soo... still about the same as a $75 5830. GJ!
member
Activity: 112
Merit: 10
Not all performance came from the shift function. Most performance came from reducing registers per thread.

Based on that, do you believe your changes should provide benefit to all nVidia hardware, not just Kepler-based boards?
newbie
Activity: 49
Merit: 0
Not all performance came from the shift function. Most performance came from reducing registers per thread.

Before I start one threads needed 114 32Bit register (134 MHash/s)
After change the code thus we use shift operation we needed 95 32Bit register (~200MHash/s)
And after add shared memory we only need 46 registers. That means we can run 5 Block with 256 threads per streaming multiprocessor and we get 330 MHash/s.

At the moment I work on a version with over 400 MHash/s but I have some problems that the mining pool not count all my solutions.

If I have time I look if I can create a windows version.
 
legendary
Activity: 1344
Merit: 1004
Please compile and build for windows. No clue what to do with source. I can only click stuff.
newbie
Activity: 49
Merit: 0
@relm9: No I have winows version, I only programm linux. I exit my bitcoin winter sleep to performe the NVIDIA GPU bitcoin mining process. I have now windows PC with K20 or Titan and therefore I can't test this with a winows miner.

@philips: Thanks for the last link, I look in if I can get some more performance.

hero member
Activity: 700
Merit: 500
Wow, can you imagine if this were distributed a year ago? 

Now I just have to figure out if I can run this inside 50miner, or if I have to use cgminer...

Maybe is not too late for Nvidia cards though...there is also this guy:
https://bitcointalksearch.org/topic/doubling-litecoin-mining-efficiency-on-nvidia-160057
hero member
Activity: 840
Merit: 1000
I've got a GTX Titan I could test this on - though, is there a version of this that compiles easily on Windows? The one the OP linked is Linux only.
sr. member
Activity: 367
Merit: 250
Find me at Bitrated
Wow, can you imagine if this were distributed a year ago? 

Now I just have to figure out if I can run this inside 50miner, or if I have to use cgminer...
Pages:
Jump to: