CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 1225.

kingscrown

hero member

Activity: 672

Merit: 500

http://fuk.io - check it out!

this mod looks SICK!

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Some of the bugs have been removed in the the Tvpouvet release that I forked. I will probobly refork. My focus is on the kernals, and only 50% of the kernals of x11 have been modded in the opensource.

I just recompiled with yesterdays NVIDIA driver (344.75) There seems to be a hashincrease on the 750ti of around 30KHASH.

polanskiman

full member

Activity: 266

Merit: 100

By the way, any intentions on including and perhaps improving m7 algo in ccminer by SP_ releases?

polanskiman

full member

Activity: 266

Merit: 100

Quote from: ?? on ??

Quote from: sp_ on November 18, 2014, 06:05:32 PM

Quote from: ?? on ??

Probably a little more; and X15 can be improved a lot.

Bitslice it if you must. That will help you remove the memory issue.

Checked in a small boost by using the perm instruction in whirlpool, but I think I have to rewrite the shared mem part to get 1/8th the memory reads.

Bitslice would mean NO memory reads.

For 750ti:

With ccminer v7 from DJm34 I have an average of:
x11 = 2605 Khash

With ccminer by SP_ release 8 I have an average of:
X11 = 2800 Khash

That's a 195 Khash average difference. It wont make me any richer but it is always welcome Cheesy

ccminer by SP_ release 8 seems to have a few bugs though. When you proceed with ctrl+c, most of the times ccminer will crash. Also when you are asked to terminate the batch job whether you say Y or N yields the same result: ccminer is closed.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: ?? on ??

Probably a little more; and X15 can be improved a lot.

Bitslice it if you must. That will help you remove the memory issue.

Checked in a small boost by using the perm instruction in whirlpool, but I think I have to rewrite the shared mem part to get 1/8th the memory reads.

th00ber

hero member

Activity: 789

Merit: 501

good job !

jpouza

legendary

Activity: 2716

Merit: 1116

Maximum on the 980, limited by 1.2500v, voltage limitations of the reference cards Sad

subefotos

jpouza

legendary

Activity: 2716

Merit: 1116

Targeting 10MH/s X11, keep pushing Cool

Disabling SLI things go higher, will print screen trying to hit 10MH/s.

hosting imagenes

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

wow

good game, didnt check blake512 for the moment, trying to fix x13 weird behavior during benchmark

but i was able to see the improvements on windows too with the previous commit

EDIT: + 10KH also with blake on the 750 ti

Code:

20.56% 4.55172s 86 52.927ms 52.850ms 53.968ms x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
18.97% 4.19956s 87 48.271ms 48.164ms 53.870ms quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
12.70% 2.81199s 87 32.322ms 32.149ms 36.061ms x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
11.33% 2.50956s 87 28.846ms 28.786ms 30.704ms x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
  7.30% 1.61676s 87 18.584ms 18.549ms 20.739ms x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.36% 1.18770s 87 13.652ms 13.590ms 15.225ms quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.08% 1.12451s 87 12.925ms 12.721ms 14.430ms x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.09% 685.04ms 86 7.9656ms 7.9084ms 8.0212ms x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.93% 648.66ms 87 7.4559ms 7.1070ms 8.3455ms quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.84% 629.44ms 87 7.2350ms 7.1123ms 7.3753ms x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.73% 604.12ms 87 6.9439ms 6.8900ms 7.7449ms quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.62% 579.68ms 87 *6.6630ms 6.6329ms 7.4384ms quark_blake512_gpu_hash_80(int, unsigned int, void*)
  2.57% 569.19ms 87 6.5424ms 6.5284ms 7.2974ms quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  1.62% 358.63ms 86 4.1702ms 4.1305ms 4.2341ms x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Faster blake (nist5, quark,x11 etc.).

http://www.filedropper.com/release8

fixed bug in release7 (unvalid nounces) (the bug was only present in the exe, because an old file was linked in instead of the latest)

source:

https://github.com/sp-hash/ccminer

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

I am only testing on windows. There was a small bug in the exe file I sendt out. Exe 7. I have fixed it, and noe I am preparing another checkin later today. Next kernal to be checked in is blake.

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

Linux profile of your repo, indeed big difference :

Code:

sp - before echo (linux x64)
==11174== Profiling result:
Time(%) Time Calls Avg Min Max Name
20.76% 2.87625s 53 54.269ms 54.098ms 55.278ms x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
18.83% 2.60877s 54 48.311ms 48.168ms 53.868ms quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
13.02% 1.80384s 54 33.404ms 32.752ms 37.241ms x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
11.04% 1.52931s 53 28.855ms 28.780ms 30.472ms x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
  7.25% 1.00414s 54 18.595ms 18.548ms 20.737ms x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.32% 737.65ms 54 13.660ms 13.589ms 15.234ms quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.03% 697.42ms 54 12.915ms 12.778ms 14.462ms x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.05% 422.23ms 53 7.9665ms 7.8972ms 8.0252ms x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.90% 401.89ms 54 7.4425ms 6.9065ms 8.3138ms quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.90% 401.74ms 54 7.4396ms 7.4077ms 8.2859ms quark_blake512_gpu_hash_80(int, unsigned int, void*)
  2.77% 383.50ms 53 7.2358ms 7.1146ms 7.3789ms x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.69% 373.04ms 54 6.9082ms 6.8450ms 7.7322ms quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.55% 353.48ms 54 6.5459ms 6.5278ms 7.2944ms quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  1.60% 221.22ms 53 4.1741ms 4.1419ms 4.2535ms x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)

sp - 12d436ae1ecdc5e647a6a1576b98c4803510b13f
==25578== Profiling result:
Time(%) Time Calls Avg Min Max Name
20.56% 6.72060s 127 *52.918ms 52.822ms 53.985ms x11_echo512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
18.89% 6.17511s 128 48.243ms 48.147ms 53.860ms quark_groestl512_gpu_hash_64_quad(int, unsigned int, unsigned int*, unsigned int*)
12.65% 4.13517s 128 *32.306ms 32.181ms 36.017ms x11_shavite512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
11.29% 3.69123s 128 28.838ms 28.787ms 30.680ms x11_simd512_gpu_expand_64(int, unsigned int, unsigned long*, unsigned int*, uint4*)
  7.27% 2.37746s 128 18.574ms 18.547ms 20.732ms x11_cubehash512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.35% 1.74723s 128 13.650ms 13.589ms 15.257ms quark_jh512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  5.05% 1.65134s 128 12.901ms 12.699ms 14.372ms x11_luffa512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  3.12% 1.01978s 128 7.9670ms 7.9183ms 8.0247ms x11_simd512_gpu_compress2_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.91% 951.09ms 128 7.4304ms 7.4097ms 8.2771ms quark_blake512_gpu_hash_80(int, unsigned int, void*)
  2.88% 941.80ms 128 7.3578ms 6.9981ms 8.3027ms quark_bmw512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.83% 926.04ms 128 7.2347ms 7.0956ms 7.3374ms x11_simd512_gpu_compress1_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)
  2.72% 887.67ms 128 6.9349ms 6.8876ms 7.7173ms quark_skein512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  2.56% 836.91ms 128 6.5384ms 6.5282ms 7.2936ms quark_keccak512_gpu_hash_64(int, unsigned int, unsigned long*, unsigned int*)
  1.63% 533.83ms 128 4.1705ms 4.1391ms 4.2481ms x11_simd512_gpu_final_64(int, unsigned int, unsigned long*, unsigned int*, uint4*, int*)

Are you testing on linux too ? or just in windows ?

Still trying to get the same gains on windows... but that take a lof of time

tbearhere

legendary

Activity: 3164

Merit: 1003

On the 750ti 50kh more x11.

jpouza

legendary

Activity: 2716

Merit: 1116

Quote from: sp_ on November 16, 2014, 06:47:07 AM

I have checked in some more performance improvements. I moved the precalc table in echo from constmem to the instruction cache. Improved registers/launchbounds on shavite.

The 980 is now around 400KHASH faster than the release 6. on stock clocks.(x11).

Here is the link:

http://www.filedropper.com/release7

Nice, 9MH/s with 185+ on 980 GPUs.
10MH/s with extreme overclock GPU at 300+ and overvolted.

750Ti boost to 2.9MH/s with 135+ GPU 460+ MEM.

Cheers

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

I have checked in some more performance improvements. I moved the precalc table in echo from constmem to the instruction cache. Improved registers/launchbounds on shavite.

The 980 is now around 400KHASH faster than the release 6. on stock clocks.(x11).

Here is the link:

http://www.filedropper.com/release7

The sourcecode is available here:

https://github.com/sp-hash/ccminer

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: Schleicher on November 16, 2014, 01:36:00 AM

Possible small optimization at the end of cuda_echo_round:

Code:

	for (int i = 0; i<15; i += 4)
	{
		W[i] ^= W[32 + i] ^ 512;
		W[i + 1] ^= W[32 + i + 1];
		W[i + 2] ^= W[32 + i + 2];
		W[i + 3] ^= W[32 + i + 3];
	}
	W[15] ^= W[47] ^ 512;

(we don't need more than 16)

Thanks, it works.

   for (int i = 0; i<15; i += 4)
   {
      W ^= W[32 + i] ^ 512;
      W[i + 1] ^= W[32 + i + 1];
      W[i + 2] ^= W[32 + i + 2];
      W[i + 3] ^= W[32 + i + 3];
   }

is enough.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: Epsylon3 on November 15, 2014, 08:41:34 PM

Indeed, +15kH on the 750 Ti (2ms improvement on your repo, its the biggest optimisation you have made on a single algo, was 0.5 before, i will pick it for the 1.5.0)
on mine, i get +9 KH (2791 vs 2800KH in benchmark mode) but i didnt take the launch bounds change for the moment...
39.171ms before, 38.522ms before = 0.65ms on mine, enough for me (but not fully comparable)
EDIT: but on windows :// seems to be lowered, investigating...

In addition to the launchbound change, did you remember to go from 256 to 320 threads when calling the kernal?. The launchbound will force the compiler to use 64 registers. We get more spills to memory, but it seems to run faster.

Schleicher

hero member

Activity: 675

Merit: 514

Possible small optimization at the end of cuda_echo_round:

Code:

	for (int i = 0; i<15; i += 4)
	{
		W[i] ^= W[32 + i] ^ 512;
		W[i + 1] ^= W[32 + i + 1];
		W[i + 2] ^= W[32 + i + 2];
		W[i + 3] ^= W[32 + i + 3];
	}
	W[15] ^= W[47] ^ 512;

(we don't need more than 16)

Epsylon3

legendary

Activity: 1484

Merit: 1082

ccminer/cpuminer developer

Quote from: sp_ on November 15, 2014, 12:57:29 PM

the problem is that the throughput is set to be fast on the 980 in the source version. download the 1.4.9 source and replace the file:

cuda_x11_echo.cu from my fork

you should get a small boost in x11. On quark I have optimized bmw and blake a tinybit

1.4.9 with the intesity parameter is found here:

https://github.com/tpruvot/ccminer

Indeed, +15kH on the 750 Ti (2ms improvement on your repo, its the biggest optimisation you have made on a single algo, was 0.5 before, i will pick it for the 1.5.0)

on mine, i get +9 KH (2791 vs 2800KH in benchmark mode) but i didnt take the launch bounds change for the moment...

39.171ms before, 38.522ms before = 0.65ms on mine, enough for me (but not fully comparable)

EDIT: but on windows :// seems to be lowered, investigating...

tbearhere

legendary

Activity: 3164

Merit: 1003

Quote from: tbearhere on October 27, 2014, 04:11:34 PM

sp your older ccminer and im really pushing it. looking forward to your new one.

the fastest hashing is 1.4.6

Topic: CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 1225. (Read 2347641 times)