Author

Topic: CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 1229. (Read 2347641 times)

legendary
Activity: 2716
Merit: 1116
Watercooled 980 max hash X11:


subir fotos a internet
legendary
Activity: 1400
Merit: 1050
I think djm34 has made a cuda implementation of neoscrypt.
no, I haven't tried yet
legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
A tip for windows users using Chrome :

type chrome://flags/ and disable WebGL

Chrome will be faster if you mine on the GPU (and the miner too Wink
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
I think djm34 has made a cuda implementation of neoscrypt.
legendary
Activity: 2002
Merit: 1051
ICO? Not even once.
While mostly everything is about the X-series nowadays, I personally don't see them being viable long term and I never liked them due to some interesting things going on with them in the background like the massive, 3.5GH/s "rig" rentals, the occasional new x11 coins being dumped way below cost of GPU production and there are upcoming ASICs with cleverhash - however far that might be.

Anyway, my point is that I'd also be more interested in neoscryt, and I'm sure we could crowdfund it.
sr. member
Activity: 285
Merit: 250
is adding neoscrypt to this possible/ Jealous of those AMD guys mining away at FTC. Cudaminer can't do neoscrypt and cgminer + NVIDIA is aweful.
I guess a better question is, is there anyone working on cudaminer just like you gentleman are taking ccminer to new heights?

vertcion is coming soon too to lyra2, and nvidia peeps wanna be ready!

"Vertans,

Firstly I’d like to introduce myself. Many of you will know me from Reddit as the developer of Greenpool. I have joined the development team recently to aid with the implementation of Lyra2RE and I plan to stay for the foreseeable future to see Vertcoin grow and ultimately succeed in this oversaturated crypto coin market.

Currently, development is well under way for the fork to Lyra2RE. The algorithms that Lyra2 will be chained with to form Lyra2RE have been decided and many essential pieces of software have been ported. A Python module for Lyra2RE has been created which means that Stratum-Mining and P2Pool are ready for the fork. Sgminer has also been modified to make it ready for the completion of a working OpenCL kernel. At this time, wallet integration needs to be completed as well as OpenCL and CUDA kernels created, the latter two of which we are looking for members of the community who are experienced with these systems to help us complete this task. Contact [email protected] if you have OpenCL or CUDA experience and would like to help out.

Regards,

Jamesl22"
hero member
Activity: 789
Merit: 501
git clone from yesterday, on Linux Kopiemtu 2 (cuda 6.5)
The previous version of the ccminer-tpruvot run well, but the one with sp_ optimizations do not work on my rig Sad
legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
seems to run fine here, which os / release ?

Code:
[2014-11-07 07:39:08] qrk.suprnova.cc:6666 quark block 1428192
[2014-11-07 07:39:08] GPU #1: GeForce GTX 750 Ti, 5127 kH/s
[2014-11-07 07:39:12] GPU #0: GeForce GTX 750 Ti, 4478 kH/s
[2014-11-07 07:39:12] accepted: 558/560 (99.64%), 9604 khash/s yay!!!
[2014-11-07 07:39:15] GPU #0: GeForce GTX 750 Ti, 4363 kH/s
[2014-11-07 07:39:15] accepted: 559/561 (99.64%), 9489 khash/s yay!!!
[2014-11-07 07:39:19] qrk.suprnova.cc:6666 quark block 1428195
hero member
Activity: 789
Merit: 501
from ccminer-tpruvot
I get this error at runtime "double free or corruption (out): 0x00007fecf00096c0 ***"

running Quark algo, but benchmark mode running fine

Same from ccminer-sp
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
The SIMD change should give a couple of 100 KHASH on the 970. At least it did it for me when I tested it. Strange that you don't get any difference.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
aes.cu also do small improvements on echo (was the remaining diff of 0.2ms), but not on shavite. Seems ok
for groestl, i just didnt pick it (there is a real improvement too (0.5ms), but .... i dont understand it yet Wink


int andmask1 = ((threadIdx.x & 0x03) - 1) >> 16;

This creates a mask of eighter 0x0000ffff or 0.

The code should create an andmask of 0x0000ffff only when if ((threadIdx.x & 0x03) == 0)

For the given inputs:

(3 & 0x03)-1 >>16=0
(2 & 0x03)-1 >>16=0
(1 & 0x03)-1 >>16=0
(0 & 0x03)-1>>16=0x0000ffff

The purpose of the shift is to remove the low bits in the andmask when the number is positive, and also set  the 16 low bits to 1 only when (threadIdx.x & 0x03) == 0 .

You also need to know that:

number^0=number
number&0=null
legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
make && nvprof ./ccminer -a x11 -o stratum+tcp://mine.xpool.ca:8888 -u XeVrkPrWB7pDbdFLfKhF1Z3xpqhsx6wkH3 -p password

on windows it doesnt work well, but i prefer linux to do that Wink much more faster to compile
on windows, you have nsight, which also gives you the timing of the different algo (an average is however missing)

Yes i use it for another reason, there are additional graphes where you can see what is "limiting" the perfs (often sign conversions)
legendary
Activity: 1400
Merit: 1050
make && nvprof ./ccminer -a x11 -o stratum+tcp://mine.xpool.ca:8888 -u XeVrkPrWB7pDbdFLfKhF1Z3xpqhsx6wkH3 -p password

on windows it doesnt work well, but i prefer linux to do that Wink much more faster to compile

aes.cu also do small improvements on echo (was the remaining diff of 0.2ms), but not on shavite. Seems ok

for groestl, i just didnt pick it (there is a real improvement too (0.5ms), but .... i dont understand it yet Wink
on windows, you have nsight, which also gives you the timing of the different algo (an average is however missing)
legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
make && nvprof ./ccminer -a x11 -o stratum+tcp://mine.xpool.ca:8888 -u XeVrkPrWB7pDbdFLfKhF1Z3xpqhsx6wkH3 -p password

on windows it doesnt work well, but i prefer linux to do that Wink much more faster to compile

aes.cu also do small improvements on echo (was the remaining diff of 0.2ms), but not on shavite. Seems ok

for groestl, i just didnt pick it (there is a real improvement too (0.5ms), but .... i dont understand it yet Wink
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
How do you profile? When I profile I comment out all the other hashfunctions in the x11 sourcecode file and the test nounce code.

In the groest function. If you deliver the same nounce to every call, it will not improve because the branchprediction in the if statement always will hit. I think the same for the SIMD. Nvidia just released a new version of the driver. (.60) did you upgrade?

Or it could be allign issues. That I get improved speed in my tests because buffers get alligned on other boundaries when I remove hashing functions and performance is lost/gained. If so we have alot of possible improvements to check.
legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
i only own some 750 Ti Wink yes our work go in the same direction. i rebuilt your sources to profile (make clean)

Code:
Time(%)      Time     Calls       Avg       Min       Max  Name
 20.67%  3.03880s        77  39.465ms  39.221ms  44.059ms  x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 18.96%  2.78695s        78  35.730ms  35.599ms  39.836ms  quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
 12.89%  1.89555s        78  24.302ms  24.154ms  27.170ms  x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 11.15%  1.63887s        78  21.011ms  20.944ms  23.439ms  x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
  7.18%  1.05509s        78  13.527ms  13.491ms  15.093ms  x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.30%  779.82ms        78  9.9977ms  9.9371ms  11.193ms  quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.00%  735.23ms        78  9.4260ms  9.2738ms  10.456ms  x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.08%  452.12ms        78  5.7964ms  5.7532ms  5.8691ms  x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  3.06%  449.25ms        78  5.7597ms  5.7311ms  6.3923ms  quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.80%  412.31ms        78  5.2861ms  5.1883ms  5.3934ms  x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.74%  402.88ms        78  5.1652ms  5.1411ms  5.7546ms  quark_blake512_gpu_hash_80(int, unsigned int, void*)
  2.69%  394.82ms        78  5.0618ms  5.0164ms  5.6468ms  quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.58%  379.45ms        78  4.8647ms  4.8361ms  5.4086ms  quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  1.62%  237.61ms        78  3.0462ms  3.0174ms  3.4082ms  x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  0.28%  41.415ms        77  537.85us  533.88us  589.98us  cuda_check_gpu_hash_64(int, unsigned int, unsigned int*, unsigned int*, unsigned int*)

Here are my results with a part of your commit (pushed on github)
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
I think you need the AES change to reduce the number of register memory spills in the ECHO hash. My binaries (based on 1.4.5 where compiled with 64 registers and not 80 registers). This is why some of the optimalizations doesn't work without changing registers and launch bounds. I will continue to verfiy and checkin more improvements, but need to test to make sure that they don't break the hash chain.

Wich GPU model are you testing on?

Also note that  groestl_functions_quad.cu, the cuda_x11_aes.cu and  simd_functions.cu file is not compiled (they are excluded from the build). When doing changes in these files, you need to do a change in the files that include them in order to rebuild them.

Strange that you get no improvements in Simd and groestl. Are you sure you did a clean and then a full build?

Looks like groestl_functions_quad.cu, the cuda_x11_aes.cu and  simd_functions.cu not have been buildt.
legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
profile result of this commit :

Code:
Time(%)      Time     Calls       Avg       Min       Max  Name
 20.82%  2.96411s        74  40.056ms  39.851ms  44.742ms  x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 18.80%  2.67690s        75  35.692ms  35.574ms  39.812ms  quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
 12.79%  1.82197s        75  24.293ms  24.166ms  27.024ms  x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 11.06%  1.57506s        75  21.001ms  20.943ms  23.427ms  x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
  7.63%  1.08677s        75  14.490ms  14.377ms  16.185ms  x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.26%  749.42ms        75  9.9923ms  9.9294ms  11.130ms  quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  4.96%  706.48ms        75  9.4197ms  9.2415ms  10.595ms  x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.05%  434.68ms        75  5.7958ms  5.7581ms  5.8440ms  x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  3.01%  427.94ms        75  5.7058ms  5.6788ms  6.3479ms  quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.78%  395.99ms        75  5.2799ms  5.1751ms  5.3694ms  x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.72%  387.33ms        75  5.1644ms  5.1375ms  5.7555ms  quark_blake512_gpu_hash_80(int, unsigned int, void*)
  2.67%  380.04ms        75  5.0671ms  5.0126ms  5.6239ms  quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.56%  364.92ms        75  4.8655ms  4.8348ms  5.4331ms  quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  1.60%  228.42ms        75  3.0456ms  3.0165ms  3.3860ms  x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  0.28%  39.825ms        74  538.17us  535.38us  591.64us  cuda_check_gpu_hash_64(int, unsigned int, unsigned int*, unsigned int*, unsigned int*)

Code:
Time(%)      Time     Calls       Avg       Min       Max  Name
 20.69%  2.94306s        75  39.241ms  39.084ms  43.784ms  x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 18.78%  2.67148s        76  35.151ms  35.039ms  39.165ms  quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
 12.98%  1.84595s        76  24.289ms  24.188ms  27.135ms  x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 11.07%  1.57478s        75  20.997ms  20.934ms  23.428ms  x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
  7.22%  1.02751s        76  13.520ms  13.483ms  15.065ms  x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.34%  759.31ms        76  9.9909ms  9.9291ms  11.134ms  quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.04%  716.24ms        76  9.4243ms  9.2858ms  10.407ms  x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.08%  438.16ms        76  5.7653ms  5.7369ms  6.4045ms  quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.06%  434.61ms        75  5.7948ms  5.7602ms  5.8804ms  x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.78%  395.22ms        75  5.2696ms  5.1818ms  5.4436ms  x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.76%  392.47ms        76  5.1640ms  5.1345ms  5.7447ms  quark_blake512_gpu_hash_80(int, unsigned int, void*)
  2.70%  384.16ms        76  5.0548ms  5.0094ms  5.6282ms  quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.60%  369.59ms        76  4.8630ms  4.8242ms  5.4206ms  quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  1.61%  228.50ms        75  3.0466ms  3.0221ms  3.3762ms  x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  0.28%  40.323ms        75  537.64us  534.49us  589.79us  cuda_check_gpu_hash_64(int, unsigned int, unsigned int*, unsigned int*, unsigned int*)

This time indeed, i see real improvements.

For information, my builds were faster because i was not using multi-arch support for binaries (i guess the current arch check take some time)

i will pick echo and cube changes, i need to analyse groestl one (looks weird) and aes one could break compatibilty with other archs + no real improvement
hero member
Activity: 789
Merit: 501
SP_ your works is really good and based on the tpruvot awesome ccminer (I love the color in term Tongue)
I can't wait to test that after work on my linux rig, I'll send you feedbacks and some btc
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
More commits are comming. The 2 first beta exe'es I sendt out didn't contain any of these improvements.
Jump to: