CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 1229.

jpouza

legendary

Activity: 2716

Merit: 1116

Watercooled 980 max hash X11:

subir fotos a internet

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on November 07, 2014, 04:32:40 AM

I think djm34 has made a cuda implementation of neoscrypt.

no, I haven't tried yet

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

A tip for windows users using Chrome :

type chrome://flags/ and disable WebGL

Chrome will be faster if you mine on the GPU (and the miner too Wink

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

I think djm34 has made a cuda implementation of neoscrypt.

bathrobehero

legendary

Activity: 2002

Merit: 1051

ICO? Not even once.

While mostly everything is about the X-series nowadays, I personally don't see them being viable long term and I never liked them due to some interesting things going on with them in the background like the massive, 3.5GH/s "rig" rentals, the occasional new x11 coins being dumped way below cost of GPU production and there are upcoming ASICs with cleverhash - however far that might be.

Anyway, my point is that I'd also be more interested in neoscryt, and I'm sure we could crowdfund it.

SS2006

sr. member

Activity: 285

Merit: 250

is adding neoscrypt to this possible/ Jealous of those AMD guys mining away at FTC. Cudaminer can't do neoscrypt and cgminer + NVIDIA is aweful.
I guess a better question is, is there anyone working on cudaminer just like you gentleman are taking ccminer to new heights?

vertcion is coming soon too to lyra2, and nvidia peeps wanna be ready!

"Vertans,

Firstly I’d like to introduce myself. Many of you will know me from Reddit as the developer of Greenpool. I have joined the development team recently to aid with the implementation of Lyra2RE and I plan to stay for the foreseeable future to see Vertcoin grow and ultimately succeed in this oversaturated crypto coin market.

Currently, development is well under way for the fork to Lyra2RE. The algorithms that Lyra2 will be chained with to form Lyra2RE have been decided and many essential pieces of software have been ported. A Python module for Lyra2RE has been created which means that Stratum-Mining and P2Pool are ready for the fork. Sgminer has also been modified to make it ready for the completion of a working OpenCL kernel. At this time, wallet integration needs to be completed as well as OpenCL and CUDA kernels created, the latter two of which we are looking for members of the community who are experienced with these systems to help us complete this task. Contact [email protected] if you have OpenCL or CUDA experience and would like to help out.

Regards,

Jamesl22"

th00ber

hero member

Activity: 789

Merit: 501

git clone from yesterday, on Linux Kopiemtu 2 (cuda 6.5)
The previous version of the ccminer-tpruvot run well, but the one with sp_ optimizations do not work on my rig Sad

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

seems to run fine here, which os / release ?

Code:

[2014-11-07 07:39:08] qrk.suprnova.cc:6666 quark block 1428192
[2014-11-07 07:39:08] GPU #1: GeForce GTX 750 Ti, 5127 kH/s
[2014-11-07 07:39:12] GPU #0: GeForce GTX 750 Ti, 4478 kH/s
[2014-11-07 07:39:12] accepted: 558/560 (99.64%), 9604 khash/s yay!!!
[2014-11-07 07:39:15] GPU #0: GeForce GTX 750 Ti, 4363 kH/s
[2014-11-07 07:39:15] accepted: 559/561 (99.64%), 9489 khash/s yay!!!
[2014-11-07 07:39:19] qrk.suprnova.cc:6666 quark block 1428195

th00ber

hero member

Activity: 789

Merit: 501

from ccminer-tpruvot
I get this error at runtime "double free or corruption (out): 0x00007fecf00096c0 ***"

running Quark algo, but benchmark mode running fine

Same from ccminer-sp

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

The SIMD change should give a couple of 100 KHASH on the 970. At least it did it for me when I tested it. Strange that you don't get any difference.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: Epsylon3 on November 06, 2014, 09:06:30 AM

aes.cu also do small improvements on echo (was the remaining diff of 0.2ms), but not on shavite. Seems ok
for groestl, i just didnt pick it (there is a real improvement too (0.5ms), but .... i dont understand it yet Wink

int andmask1 = ((threadIdx.x & 0x03) - 1) >> 16;

This creates a mask of eighter 0x0000ffff or 0.

The code should create an andmask of 0x0000ffff only when if ((threadIdx.x & 0x03) == 0)

For the given inputs:

(3 & 0x03)-1 >>16=0
(2 & 0x03)-1 >>16=0
(1 & 0x03)-1 >>16=0
(0 & 0x03)-1>>16=0x0000ffff

The purpose of the shift is to remove the low bits in the andmask when the number is positive, and also set the 16 low bits to 1 only when (threadIdx.x & 0x03) == 0 .

You also need to know that:

number^0=number
number&0=null

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

Quote from: djm34 on November 06, 2014, 09:19:20 AM

Quote from: Epsylon3 on November 06, 2014, 09:06:30 AM

make && nvprof ./ccminer -a x11 -o stratum+tcp://mine.xpool.ca:8888 -u XeVrkPrWB7pDbdFLfKhF1Z3xpqhsx6wkH3 -p password

on windows it doesnt work well, but i prefer linux to do that Wink

much more faster to compile

on windows, you have nsight, which also gives you the timing of the different algo (an average is however missing)

Yes i use it for another reason, there are additional graphes where you can see what is "limiting" the perfs (often sign conversions)

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: Epsylon3 on November 06, 2014, 09:06:30 AM

make && nvprof ./ccminer -a x11 -o stratum+tcp://mine.xpool.ca:8888 -u XeVrkPrWB7pDbdFLfKhF1Z3xpqhsx6wkH3 -p password

on windows it doesnt work well, but i prefer linux to do that Wink

much more faster to compile

aes.cu also do small improvements on echo (was the remaining diff of 0.2ms), but not on shavite. Seems ok

for groestl, i just didnt pick it (there is a real improvement too (0.5ms), but .... i dont understand it yet Wink

on windows, you have nsight, which also gives you the timing of the different algo (an average is however missing)

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

make && nvprof ./ccminer -a x11 -o stratum+tcp://mine.xpool.ca:8888 -u XeVrkPrWB7pDbdFLfKhF1Z3xpqhsx6wkH3 -p password

on windows it doesnt work well, but i prefer linux to do that Wink

much more faster to compile

aes.cu also do small improvements on echo (was the remaining diff of 0.2ms), but not on shavite. Seems ok

for groestl, i just didnt pick it (there is a real improvement too (0.5ms), but .... i dont understand it yet Wink

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

How do you profile? When I profile I comment out all the other hashfunctions in the x11 sourcecode file and the test nounce code.

In the groest function. If you deliver the same nounce to every call, it will not improve because the branchprediction in the if statement always will hit. I think the same for the SIMD. Nvidia just released a new version of the driver. (.60) did you upgrade?

Or it could be allign issues. That I get improved speed in my tests because buffers get alligned on other boundaries when I remove hashing functions and performance is lost/gained. If so we have alot of possible improvements to check.

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

i only own some 750 Ti Wink

yes our work go in the same direction. i rebuilt your sources to profile (make clean)

Code:

Time(%) Time Calls Avg Min Max Name
20.67% 3.03880s 77 39.465ms 39.221ms 44.059ms x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
18.96% 2.78695s 78 35.730ms 35.599ms 39.836ms quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
12.89% 1.89555s 78 24.302ms 24.154ms 27.170ms x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
11.15% 1.63887s 78 21.011ms 20.944ms 23.439ms x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
7.18% 1.05509s 78 13.527ms 13.491ms 15.093ms x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
5.30% 779.82ms 78 9.9977ms 9.9371ms 11.193ms quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
5.00% 735.23ms 78 9.4260ms 9.2738ms 10.456ms x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
3.08% 452.12ms 78 5.7964ms 5.7532ms 5.8691ms x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
3.06% 449.25ms 78 5.7597ms 5.7311ms 6.3923ms quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
2.80% 412.31ms 78 5.2861ms 5.1883ms 5.3934ms x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
2.74% 402.88ms 78 5.1652ms 5.1411ms 5.7546ms quark_blake512_gpu_hash_80(int, unsigned int, void*)
2.69% 394.82ms 78 5.0618ms 5.0164ms 5.6468ms quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
2.58% 379.45ms 78 4.8647ms 4.8361ms 5.4086ms quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
1.62% 237.61ms 78 3.0462ms 3.0174ms 3.4082ms x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
0.28% 41.415ms 77 537.85us 533.88us 589.98us cuda_check_gpu_hash_64(int, unsigned int, unsigned int*, unsigned int*, unsigned int*)

Here are my results with a part of your commit (pushed on github)

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

I think you need the AES change to reduce the number of register memory spills in the ECHO hash. My binaries (based on 1.4.5 where compiled with 64 registers and not 80 registers). This is why some of the optimalizations doesn't work without changing registers and launch bounds. I will continue to verfiy and checkin more improvements, but need to test to make sure that they don't break the hash chain.

Wich GPU model are you testing on?

Also note that groestl_functions_quad.cu, the cuda_x11_aes.cu and simd_functions.cu file is not compiled (they are excluded from the build). When doing changes in these files, you need to do a change in the files that include them in order to rebuild them.

Strange that you get no improvements in Simd and groestl. Are you sure you did a clean and then a full build?

Looks like groestl_functions_quad.cu, the cuda_x11_aes.cu and simd_functions.cu not have been buildt.

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

profile result of this commit :

Code:

Time(%) Time Calls Avg Min Max Name
20.82% 2.96411s 74 40.056ms 39.851ms 44.742ms x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
18.80% 2.67690s 75 35.692ms 35.574ms 39.812ms quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
12.79% 1.82197s 75 24.293ms 24.166ms 27.024ms x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
11.06% 1.57506s 75 21.001ms 20.943ms 23.427ms x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
  7.63% 1.08677s 75 14.490ms 14.377ms 16.185ms x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.26% 749.42ms 75 9.9923ms 9.9294ms 11.130ms quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  4.96% 706.48ms 75 9.4197ms 9.2415ms 10.595ms x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.05% 434.68ms 75 5.7958ms 5.7581ms 5.8440ms x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  3.01% 427.94ms 75 5.7058ms 5.6788ms 6.3479ms quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.78% 395.99ms 75 5.2799ms 5.1751ms 5.3694ms x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.72% 387.33ms 75 5.1644ms 5.1375ms 5.7555ms quark_blake512_gpu_hash_80(int, unsigned int, void*)
  2.67% 380.04ms 75 5.0671ms 5.0126ms 5.6239ms quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.56% 364.92ms 75 4.8655ms 4.8348ms 5.4331ms quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  1.60% 228.42ms 75 3.0456ms 3.0165ms 3.3860ms x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  0.28% 39.825ms 74 538.17us 535.38us 591.64us cuda_check_gpu_hash_64(int, unsigned int, unsigned int*, unsigned int*, unsigned int*)

Code:

Time(%) Time Calls Avg Min Max Name
20.69% 2.94306s 75 39.241ms 39.084ms 43.784ms x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
18.78% 2.67148s 76 35.151ms 35.039ms 39.165ms quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
12.98% 1.84595s 76 24.289ms 24.188ms 27.135ms x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
11.07% 1.57478s 75 20.997ms 20.934ms 23.428ms x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
  7.22% 1.02751s 76 13.520ms 13.483ms 15.065ms x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.34% 759.31ms 76 9.9909ms 9.9291ms 11.134ms quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.04% 716.24ms 76 9.4243ms 9.2858ms 10.407ms x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.08% 438.16ms 76 5.7653ms 5.7369ms 6.4045ms quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.06% 434.61ms 75 5.7948ms 5.7602ms 5.8804ms x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.78% 395.22ms 75 5.2696ms 5.1818ms 5.4436ms x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.76% 392.47ms 76 5.1640ms 5.1345ms 5.7447ms quark_blake512_gpu_hash_80(int, unsigned int, void*)
  2.70% 384.16ms 76 5.0548ms 5.0094ms 5.6282ms quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.60% 369.59ms 76 4.8630ms 4.8242ms 5.4206ms quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  1.61% 228.50ms 75 3.0466ms 3.0221ms 3.3762ms x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  0.28% 40.323ms 75 537.64us 534.49us 589.79us cuda_check_gpu_hash_64(int, unsigned int, unsigned int*, unsigned int*, unsigned int*)

This time indeed, i see real improvements.

For information, my builds were faster because i was not using multi-arch support for binaries (i guess the current arch check take some time)

i will pick echo and cube changes, i need to analyse groestl one (looks weird) and aes one could break compatibilty with other archs + no real improvement

th00ber

hero member

Activity: 789

Merit: 501

SP_ your works is really good and based on the tpruvot awesome ccminer (I love the color in term Tongue

)
I can't wait to test that after work on my linux rig, I'll send you feedbacks and some btc

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

More commits are comming. The 2 first beta exe'es I sendt out didn't contain any of these improvements.

Topic: CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 1229. (Read 2347641 times)