CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 1160.

scryptr

legendary

Activity: 1797

Merit: 1028

QUBIT --

I've been mining qubit on Yaamp with no accepts. No errors, maybe one or two accepts, but a long chain of block change messages, one after another. My 960 is getting about 7880kh/s, but no productive work. I am using version 39, on Windows 7.

sp-ccminer v39:

scryptr image

tpruvot ccminer v1.5.3:

scryptr image

--scryptr

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: Namsbreh on February 18, 2015, 05:39:55 AM

How is the next version of the spreadminer coming along?

I'm rewriting the sha part. Perhaps I'm done in the weekend.

Namsbreh

sr. member

Activity: 318

Merit: 250

How is the next version of the spreadminer coming along?

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Didn't work.

the strange is that this doesn't work eighter:

__device__ __forceinline__
uint32_t SPRead(uint32_t *const __restrict__ x, uint32_t low)
{
uint32_t *tmp;
tmp=x;
return tmp[low];
}
...
tmp=SPRead(sharedmemory,x0 &0xff);

but this works:

tmp= sharedmemory[x0 &0xff];

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: sp_ on February 17, 2015, 04:33:38 PM

Any coders who can help me?

I try to reduce the number of instructions from 4 to 2 per shared mem access in cuda_x11_aes.cu

I try to force the pointer to be correct with this code:

__device__ __forceinline__
uint32_t SPRead(uint32_t *const __restrict__ x, uint32_t low)
{
   uint32_t tmp,res;
   asm("bfi.b32 %0, %1, %2, 10 , 8;" : "=r"(tmp) : "r"(low), "r"(x));
   asm("ld.shared.u32 %0,[%1];" : "=r"(res) : "r"(tmp));
   return res;
}

tmp in the second instruction should be a pointer

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Any coders who can help me?

I try to reduce the number of instructions from 4 to 2 per shared mem access in cuda_x11_aes.cu

I try to force the pointer to be correct with this code:

__device__ __forceinline__
uint32_t SPRead(uint32_t *const __restrict__ x, uint32_t low)
{
   uint32_t tmp,res;
   asm("bfi.b32 %0, %1, %2, 10 , 8;" : "=r"(tmp) : "r"(low), "r"(x));
   asm("ld.shared.u32 %0,[%1];" : "=r"(res) : "r"(tmp));
   return res;
}

the ptx looks good

cvta.shared.u32    %r393, _Z23x11_echo512_gpu_hash_64jjPy$__cuda_local_var_204566_57_non_const_sharedMemory;
// inline asm
bfi.b32 %r232, %r18, %r393, 10 , 8;
// inline asm
// inline asm
ld.shared.u32 %r235,[%r232];

But the program crash with illegal memory access. The shared memory pointer is alligned to 1024 boundary.

Without the pointer hack I get code like this:

   bfe.u32 %r243, %r4, %r440, 8;
   // inline asm
   shl.b32    %r468, %r243, 2;
   add.s32    %r469, %r468, %r451;
   ld.shared.u32    %r470, [%r469+1024];

4 instructions.

rednoW

legendary

Activity: 1510

Merit: 1003

Quote from: sp_ on February 17, 2015, 04:33:41 AM

The cuda_x11_aes.cu is excluded from the project file, so if you change it it will not build unless you save echo or shavite or take a full build. To messure you can use Fresh, because this has fewer chained hashing algos.

I checked a VS build log after rollback. cuda_x11_aes.cu was #included in 2 other .cu files that were rebuilt by VS. So I think I made it right.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: scotte on February 16, 2015, 07:36:12 AM

qubit has slowed from 3609 kH/s on the previous version to 3571 kH/s on the current one - GTX 850M Linux, using "git pull" to update, build.sh has -O3 passed to configure

The fresh algo uses 2 rounds of shavite, 2 of simd and 1 echo.

shavite512
simd512
shavite512
simd512
echo512

The shavite and echo is affected in my AES change, that is faster on the compute 5.2 cards. I am rewriting the AES now, so the next commit will hopefully increase the performance of fresh.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Another problem is 64bit vs 32bit. And windows vs linux

I optimize for windows and 32bit. On linux builds are normally 64 bit(Some optimalizations are faster when building 32bit. )

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: rednoW on February 17, 2015, 03:16:44 AM

Quote from: sp_ on February 16, 2015, 06:04:01 PM

The last change is Aes is slower on the 750ti. I am working to improve it.

I've rolled back in cuda_x11_aes.cu and seen no difference. Also commit "Faster shabal" https://github.com/sp-hash/ccminer/commit/c7eef5275ab77f02d3d86601092774fae8a29cd7 doesn't change anything in rates on my setup.

x14 has 14 chained algos. If I optmize shabal 2% , the increase of the total hash is very small. like 0.005% faster. This is because (groest, echo, simd etc are much slower. and take most of the time)

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: rednoW on February 17, 2015, 03:16:44 AM

Quote from: sp_ on February 16, 2015, 06:04:01 PM

The last change is Aes is slower on the 750ti. I am working to improve it.

I've rolled back in cuda_x11_aes.cu and seen no difference. Also commit "Faster shabal" https://github.com/sp-hash/ccminer/commit/c7eef5275ab77f02d3d86601092774fae8a29cd7 doesn't change anything in rates on my setup.

The cuda_x11_aes.cu is excluded from the project file, so if you change it it will not build unless you save echo or shavite or take a full build. To messure you can use Fresh, because this has fewer chained hashing algos.
The differences are small, but the PTX code. (assembly code) shows less instructions with my changes. And also the different cards have different timings.
A small speedup of 1kHASH will not be noticable, but 100 small speedups will.

rednoW

legendary

Activity: 1510

Merit: 1003

Quote from: sp_ on February 16, 2015, 06:04:01 PM

The last change is Aes is slower on the 750ti. I am working to improve it.

I've rolled back in cuda_x11_aes.cu and seen no difference. Also commit "Faster shabal" https://github.com/sp-hash/ccminer/commit/c7eef5275ab77f02d3d86601092774fae8a29cd7 doesn't change anything in rates on my setup.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

The last change is Aes is slower on the 750ti. I am working to improve it.

RtRaway

member

Activity: 90

Merit: 10

Can you tell me a good video card for $370?
Tq

tbearhere

legendary

Activity: 3164

Merit: 1003

#39 less hashrate on most algos. I'm still using the best one #33. But keccak was 162,000kh/s now with #39 170,000kh/s. 750ti.

antantti

legendary

Activity: 1176

Merit: 1015

Quote from: djm34 on February 16, 2015, 10:18:50 AM

those algo don't use much vram, it shouldn't be a problem (actually there isn't much algo which use more than 2Gb)

Sorry sp_ for hijacking this thread but need to ask you and djm34, what are those algos that benefit from vram --> 2GB. And also, why some algos get a boost from memory oc and most
don't? Usually it is all about core clock, you can heavily downclock memory to save power and there is no difference in mining performance.

And if that 2GB doesn't matter what is holding back gtx 960 vs 750ti? On djm's neoscrypt miner 960 does +100% against 750ti, on other algos most of the time the difference is
somewhere in +60-90% area.

Sorry guys, computer enthusiast without coding skills just want's to learn...

djm34

legendary

Activity: 1400

Merit: 1050

Quote from: bronan on February 16, 2015, 09:44:41 AM

Quote from: hefty on February 15, 2015, 03:41:52 PM

New build with more hashing power in most algos.

from release 38 we have

-faster keccak
-faster fugue
-faster hamsi
-faster aes
-faster echo
-some bugfixes

1.5.39(sp-MOD) is available here: (15-feb-2015)

https://github.com/sp-hash/ccminer/releases/tag/1.5.39

The sourcecode is available here:

https://github.com/sp-hash/ccminer

did you check how much memory is being needed for this, could it be the 970 mem limit at 3.5 Gb?

The 980 has 4 gb besides that there is not difference between 970 and 980 i think besides the 980 has more cores

still waiting for my watercooled 980 to be delivered

those algo don't use much vram, it shouldn't be a problem (actually there isn't much algo which use more than 2Gb)

bronan

hero member

Activity: 774

Merit: 500

Lazy Lurker Reads Alot

Quote from: hefty on February 15, 2015, 03:41:52 PM

Quote from: sp_ on February 15, 2015, 06:13:48 AM

New build with more hashing power in most algos.

from release 38 we have

-faster keccak
-faster fugue
-faster hamsi
-faster aes
-faster echo
-some bugfixes

1.5.39(sp-MOD) is available here: (15-feb-2015)

https://github.com/sp-hash/ccminer/releases/tag/1.5.39

The sourcecode is available here:

https://github.com/sp-hash/ccminer

did you check how much memory is being needed for this, could it be the 970 mem limit at 3.5 Gb?

The 980 has 4 gb besides that there is not difference between 970 and 980 i think besides the 980 has more cores

still waiting for my watercooled 980 to be delivered

scotte

newbie

Activity: 14

Merit: 0

qubit has slowed from 3609 kH/s on the previous version to 3571 kH/s on the current one - GTX 850M Linux, using "git pull" to update, build.sh has -O3 passed to configure

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Tonight I will try to use the BFI instruction to modify pointers, I will allign sharemem to 1024 bytes boundaries. and insert the 8 bit offset in bits 2-10 directly into the pointer.

Today a byteperm+shift+add is needed to calculate the sharedmem adress
(echo/shavite)
Hopefully AES will be faster.

3 instructions down to 1 instruction.

In echo alone there are 5120 random sharemem accesses. If this works, 1 round of Echo(hashalgo #11 in x11) will use around 10240 less instructions to do the same work. IMHO massive improvement.

In my bitslice groestl improvement i removed around 1000 instructions and the x11 hash got 50-100KHASH faster on the 750ti.
With 10 000 instructions removed, perhaps 500KHASH faster is possible.

But I assume 32bit pointers, so it might not work on 64bit builds. And it might not work in cuda code alone. Perhaps I need to rewrite AESround function to asm.
And the BFI might be slower than other instructions, so less instructions but more expensive instructions.

anyway. will try it out.

Topic: CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 1160. (Read 2347677 times)