Author

Topic: CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 1225. (Read 2347588 times)

hero member
Activity: 672
Merit: 500
http://fuk.io - check it out!
this mod looks SICK!
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
Some of the bugs have been removed in the the Tvpouvet release that I forked. I will probobly refork. My focus is on the kernals, and only 50% of the kernals of x11 have been modded in the opensource.

I just recompiled with yesterdays NVIDIA driver (344.75) There seems to be a hashincrease on the 750ti of around 30KHASH.
full member
Activity: 266
Merit: 100
By the way, any intentions on including and perhaps improving m7 algo in ccminer by SP_  releases?
full member
Activity: 266
Merit: 100
Probably a little more; and X15 can be improved a lot.
Bitslice it if you must. That will help you remove the memory issue.

Checked in a small boost by using the perm  instruction in whirlpool, but I think I have to rewrite the shared mem part to get 1/8th the memory reads.

Bitslice would mean NO memory reads.

For 750ti:

With ccminer v7 from DJm34 I have an average of:
x11 = 2605 Khash

With ccminer by SP_ release 8 I have an average of:
X11 = 2800 Khash

That's a 195 Khash average difference. It wont make me any richer but it is always welcome Cheesy

ccminer by SP_ release 8 seems to have a few bugs though. When you proceed with ctrl+c, most of the times ccminer will crash. Also when you are asked to terminate the batch job whether you say Y or N yields the same result: ccminer is closed.
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
Probably a little more; and X15 can be improved a lot.
Bitslice it if you must. That will help you remove the memory issue.

Checked in a small boost by using the perm  instruction in whirlpool, but I think I have to rewrite the shared mem part to get 1/8th the memory reads.
hero member
Activity: 789
Merit: 501
good job ! Smiley
legendary
Activity: 2716
Merit: 1116
Maximum on the 980, limited by 1.2500v, voltage limitations of the reference cards Sad


subefotos
legendary
Activity: 2716
Merit: 1116
Targeting 10MH/s X11, keep pushing  Cool

Disabling SLI things go higher, will print screen trying to hit 10MH/s.


hosting imagenes
legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
wow Smiley good game, didnt check blake512 for the moment, trying to fix x13 weird behavior during benchmark

but i was able to see the improvements on windows too with the previous commit

EDIT: + 10KH also with blake on the 750 ti

Code:

 20.56%  4.55172s        86  52.927ms  52.850ms  53.968ms  x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 18.97%  4.19956s        87  48.271ms  48.164ms  53.870ms  quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
 12.70%  2.81199s        87  32.322ms  32.149ms  36.061ms  x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 11.33%  2.50956s        87  28.846ms  28.786ms  30.704ms  x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
  7.30%  1.61676s        87  18.584ms  18.549ms  20.739ms  x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.36%  1.18770s        87  13.652ms  13.590ms  15.225ms  quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.08%  1.12451s        87  12.925ms  12.721ms  14.430ms  x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.09%  685.04ms        86  7.9656ms  7.9084ms  8.0212ms  x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.93%  648.66ms        87  7.4559ms  7.1070ms  8.3455ms  quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.84%  629.44ms        87  7.2350ms  7.1123ms  7.3753ms  x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.73%  604.12ms        87  6.9439ms  6.8900ms  7.7449ms  quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.62%  579.68ms        87  *6.6630ms  6.6329ms  7.4384ms  quark_blake512_gpu_hash_80(int, unsigned int, void*)
  2.57%  569.19ms        87  6.5424ms  6.5284ms  7.2974ms  quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  1.62%  358.63ms        86  4.1702ms  4.1305ms  4.2341ms  x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
Faster blake (nist5, quark,x11 etc.).

http://www.filedropper.com/release8

fixed bug in release7 (unvalid nounces) (the bug was only present in the exe, because an old file was linked in instead of the latest)

source:

https://github.com/sp-hash/ccminer
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
I am only testing on windows. There was a small bug in the exe file I sendt out. Exe 7. I have fixed it, and noe I am preparing another checkin later today. Next kernal to be checked in is blake.
legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
Linux profile of your repo, indeed big difference :

Code:
sp - before echo (linux x64)
==11174== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 20.76%  2.87625s        53  54.269ms  54.098ms  55.278ms  x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 18.83%  2.60877s        54  48.311ms  48.168ms  53.868ms  quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
 13.02%  1.80384s        54  33.404ms  32.752ms  37.241ms  x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 11.04%  1.52931s        53  28.855ms  28.780ms  30.472ms  x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
  7.25%  1.00414s        54  18.595ms  18.548ms  20.737ms  x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.32%  737.65ms        54  13.660ms  13.589ms  15.234ms  quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.03%  697.42ms        54  12.915ms  12.778ms  14.462ms  x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.05%  422.23ms        53  7.9665ms  7.8972ms  8.0252ms  x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.90%  401.89ms        54  7.4425ms  6.9065ms  8.3138ms  quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.90%  401.74ms        54  7.4396ms  7.4077ms  8.2859ms  quark_blake512_gpu_hash_80(int, unsigned int, void*)
  2.77%  383.50ms        53  7.2358ms  7.1146ms  7.3789ms  x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.69%  373.04ms        54  6.9082ms  6.8450ms  7.7322ms  quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.55%  353.48ms        54  6.5459ms  6.5278ms  7.2944ms  quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  1.60%  221.22ms        53  4.1741ms  4.1419ms  4.2535ms  x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)

sp - 12d436ae1ecdc5e647a6a1576b98c4803510b13f
==25578== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 20.56%  6.72060s       127  *52.918ms  52.822ms  53.985ms  x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 18.89%  6.17511s       128  48.243ms  48.147ms  53.860ms  quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
 12.65%  4.13517s       128  *32.306ms  32.181ms  36.017ms  x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
 11.29%  3.69123s       128  28.838ms  28.787ms  30.680ms  x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
  7.27%  2.37746s       128  18.574ms  18.547ms  20.732ms  x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.35%  1.74723s       128  13.650ms  13.589ms  15.257ms  quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.05%  1.65134s       128  12.901ms  12.699ms  14.372ms  x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.12%  1.01978s       128  7.9670ms  7.9183ms  8.0247ms  x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.91%  951.09ms       128  7.4304ms  7.4097ms  8.2771ms  quark_blake512_gpu_hash_80(int, unsigned int, void*)
  2.88%  941.80ms       128  7.3578ms  6.9981ms  8.3027ms  quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.83%  926.04ms       128  7.2347ms  7.0956ms  7.3374ms  x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.72%  887.67ms       128  6.9349ms  6.8876ms  7.7173ms  quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.56%  836.91ms       128  6.5384ms  6.5282ms  7.2936ms  quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  1.63%  533.83ms       128  4.1705ms  4.1391ms  4.2481ms  x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)

Are you testing on linux too ? or just in windows ?

Still trying to get the same gains on windows... but that take a lof of time
legendary
Activity: 3164
Merit: 1003
On the 750ti 50kh more x11.  Smiley
legendary
Activity: 2716
Merit: 1116
I have checked in some more performance improvements. I moved the precalc table in echo from constmem to the instruction cache. Improved registers/launchbounds on shavite.

The 980 is now around 400KHASH faster than the release 6. on stock clocks.(x11).

Here is the link:

http://www.filedropper.com/release7

Nice, 9MH/s with 185+ on 980 GPUs.
10MH/s with extreme overclock GPU at 300+ and overvolted.

750Ti boost to 2.9MH/s with 135+ GPU 460+ MEM.

Cheers
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
I have checked in some more performance improvements. I moved the precalc table in echo from constmem to the instruction cache. Improved registers/launchbounds on shavite.

The 980 is now around 400KHASH faster than the release 6. on stock clocks.(x11).

Here is the link:

http://www.filedropper.com/release7

The sourcecode is available here:

https://github.com/sp-hash/ccminer
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
Possible small optimization at the end of cuda_echo_round:
Code:
	for (int i = 0; i<15; i += 4)
{
W[i] ^= W[32 + i] ^ 512;
W[i + 1] ^= W[32 + i + 1];
W[i + 2] ^= W[32 + i + 2];
W[i + 3] ^= W[32 + i + 3];
}
W[15] ^= W[47] ^ 512;
(we don't need more than 16)

Thanks, it works.

   for (int i = 0; i<15; i += 4)
   {
      W ^= W[32 + i] ^ 512;
      W[i + 1] ^= W[32 + i + 1];
      W[i + 2] ^= W[32 + i + 2];
      W[i + 3] ^= W[32 + i + 3];
   }

is enough.
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
Indeed, +15kH on the 750 Ti (2ms improvement on your repo, its the biggest optimisation you have made on a single algo, was 0.5 before, i will pick it for the 1.5.0)
on mine, i get +9 KH (2791 vs 2800KH in benchmark mode) but i didnt take the launch bounds change for the moment...
39.171ms before, 38.522ms before = 0.65ms on mine, enough for me (but not fully comparable)
EDIT: but on windows :// seems to be lowered, investigating...

In addition to the launchbound change, did you remember to go from 256 to 320 threads when calling the kernal?. The launchbound will force the compiler to use 64 registers. We get more spills to memory, but it seems to run faster.
hero member
Activity: 675
Merit: 514
Possible small optimization at the end of cuda_echo_round:
Code:
	for (int i = 0; i<15; i += 4)
{
W[i] ^= W[32 + i] ^ 512;
W[i + 1] ^= W[32 + i + 1];
W[i + 2] ^= W[32 + i + 2];
W[i + 3] ^= W[32 + i + 3];
}
W[15] ^= W[47] ^ 512;
(we don't need more than 16)
legendary
Activity: 1484
Merit: 1082
ccminer/cpuminer developer
the problem is that the throughput is set to be fast on the 980 in the source version. download the 1.4.9 source and replace the file:

cuda_x11_echo.cu from my fork

you should get a small boost in x11. On quark I have optimized bmw and blake a tinybit

1.4.9 with the intesity parameter is found here:

 https://github.com/tpruvot/ccminer

Indeed, +15kH on the 750 Ti (2ms improvement on your repo, its the biggest optimisation you have made on a single algo, was 0.5 before, i will pick it for the 1.5.0)

on mine, i get +9 KH (2791 vs 2800KH in benchmark mode) but i didnt take the launch bounds change for the moment...

39.171ms before, 38.522ms before = 0.65ms on mine, enough for me (but not fully comparable)

EDIT: but on windows :// seems to be lowered, investigating...
legendary
Activity: 3164
Merit: 1003


sp your older ccminer  and im really pushing it. looking forward to your new one.  Smiley
the fastest hashing is 1.4.6  
Jump to: