Author

Topic: CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 1160. (Read 2347677 times)

legendary
Activity: 1797
Merit: 1028
QUBIT --

I've been mining qubit on Yaamp with no accepts.  No errors, maybe one or two accepts, but a long chain of block change messages, one after another.  My 960 is getting about 7880kh/s, but no productive work.   I am using version 39, on Windows 7.    

sp-ccminer v39:


scryptr image


tpruvot ccminer v1.5.3:


scryptr image

--scryptr
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
How is the next version of the spreadminer coming along?

I'm rewriting the sha part. Perhaps I'm done in the weekend.
sr. member
Activity: 318
Merit: 250
How is the next version of the spreadminer coming along?
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Didn't work.

the strange is that this doesn't work eighter:

__device__ __forceinline__
uint32_t  SPRead(uint32_t *const __restrict__ x, uint32_t low)
{
   uint32_t *tmp;
   tmp=x;     
   return tmp[low];
}
...
tmp=SPRead(sharedmemory,x0 &0xff);


but this works:

tmp= sharedmemory[x0 &0xff];

legendary
Activity: 1400
Merit: 1050
Any coders who can help me?

I try to reduce the  number of instructions from 4 to 2 per shared mem access in cuda_x11_aes.cu

I try to force the pointer to be correct with this code:

__device__ __forceinline__
uint32_t  SPRead(uint32_t *const __restrict__ x, uint32_t low)
{
   uint32_t tmp,res;
   asm("bfi.b32 %0, %1, %2, 10 , 8;" : "=r"(tmp) : "r"(low), "r"(x));
   asm("ld.shared.u32  %0,[%1];" : "=r"(res) : "r"(tmp));
   return res;
}
tmp in the second instruction should be a pointer
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Any coders who can help me?

I try to reduce the  number of instructions from 4 to 2 per shared mem access in cuda_x11_aes.cu

I try to force the pointer to be correct with this code:

__device__ __forceinline__
uint32_t  SPRead(uint32_t *const __restrict__ x, uint32_t low)
{
   uint32_t tmp,res;
   asm("bfi.b32 %0, %1, %2, 10 , 8;" : "=r"(tmp) : "r"(low), "r"(x));
   asm("ld.shared.u32  %0,[%1];" : "=r"(res) : "r"(tmp));
   return res;
}

the ptx looks good

cvta.shared.u32    %r393, _Z23x11_echo512_gpu_hash_64jjPy$__cuda_local_var_204566_57_non_const_sharedMemory;
// inline asm
bfi.b32 %r232, %r18, %r393, 10 , 8;
// inline asm
// inline asm
ld.shared.u32  %r235,[%r232];


But the program crash with illegal memory access. The shared memory pointer is alligned to 1024 boundary.


Without the pointer hack I get code like this:

   bfe.u32 %r243, %r4, %r440, 8;
   // inline asm
   shl.b32    %r468, %r243, 2;
   add.s32    %r469, %r468, %r451;
   ld.shared.u32    %r470, [%r469+1024];

4 instructions.
legendary
Activity: 1510
Merit: 1003
The cuda_x11_aes.cu is excluded from the project file, so if you change it it will not build unless you save echo or shavite or take a full build. To messure you can use Fresh, because this has fewer chained hashing algos.

I checked a VS build log after rollback. cuda_x11_aes.cu was #included in 2 other .cu files that were rebuilt by VS. So I think I made it right.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
qubit has slowed from 3609 kH/s on the previous version to 3571 kH/s on the current one - GTX 850M Linux, using "git pull" to update, build.sh has -O3 passed to configure

The fresh algo uses 2 rounds of shavite, 2 of simd and 1 echo.

shavite512
simd512
shavite512
simd512
echo512

The shavite and echo is affected in my AES change, that is faster on the compute 5.2 cards. I am rewriting the AES now, so the next commit will hopefully increase the performance of fresh.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Another problem is 64bit vs 32bit. And windows vs linux

I optimize for windows and 32bit. On linux builds are normally 64 bit(Some optimalizations are faster when building 32bit. )
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
The last change is Aes is slower on the 750ti. I am working to improve it.
I've rolled back in cuda_x11_aes.cu and seen no difference. Also commit "Faster shabal" https://github.com/sp-hash/ccminer/commit/c7eef5275ab77f02d3d86601092774fae8a29cd7 doesn't change anything in rates on my setup.

x14 has 14 chained algos. If  I optmize shabal 2% , the increase of the total hash is very small. like 0.005% faster. This is because (groest, echo, simd etc are much slower. and take most of the time)
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
The last change is Aes is slower on the 750ti. I am working to improve it.
I've rolled back in cuda_x11_aes.cu and seen no difference. Also commit "Faster shabal" https://github.com/sp-hash/ccminer/commit/c7eef5275ab77f02d3d86601092774fae8a29cd7 doesn't change anything in rates on my setup.

The cuda_x11_aes.cu is excluded from the project file, so if you change it it will not build unless you save echo or shavite or take a full build. To messure you can use Fresh, because this has fewer chained hashing algos.
The differences are small, but the PTX code. (assembly code) shows less instructions with my changes. And also the different cards have different timings.
A small speedup of 1kHASH will not be noticable, but 100 small speedups will.
legendary
Activity: 1510
Merit: 1003
The last change is Aes is slower on the 750ti. I am working to improve it.
I've rolled back in cuda_x11_aes.cu and seen no difference. Also commit "Faster shabal" https://github.com/sp-hash/ccminer/commit/c7eef5275ab77f02d3d86601092774fae8a29cd7 doesn't change anything in rates on my setup.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
The last change is Aes is slower on the 750ti. I am working to improve it.
member
Activity: 90
Merit: 10
Can you tell me a good video card for $370?
Tq
legendary
Activity: 3164
Merit: 1003
#39 less hashrate on most algos. I'm still using the best one #33. But keccak was 162,000kh/s now with #39  170,000kh/s. 750ti.
legendary
Activity: 1176
Merit: 1015
those algo don't use much vram, it shouldn't be a problem (actually there isn't much algo which use more than 2Gb)

Sorry sp_ for hijacking this thread but need to ask you and djm34, what are those algos that benefit from vram --> 2GB. And also, why some algos get a boost from memory oc and most
don't? Usually it is all about core clock, you can heavily downclock memory to save power and there is no difference in mining performance.

And if that 2GB doesn't matter what is holding back gtx 960 vs 750ti? On djm's neoscrypt miner 960 does +100% against 750ti, on other algos most of the time the difference is
somewhere in +60-90% area.

Sorry guys, computer enthusiast without coding skills just want's to learn...
legendary
Activity: 1400
Merit: 1050

New build with more hashing power in most algos.

from release 38 we have

-faster keccak
-faster fugue
-faster hamsi
-faster aes
-faster echo
-some bugfixes

1.5.39(sp-MOD) is available here: (15-feb-2015)

https://github.com/sp-hash/ccminer/releases/tag/1.5.39

The sourcecode is available here:

https://github.com/sp-hash/ccminer


did you check how much memory is being needed for this, could it be the 970 mem limit at 3.5 Gb?

The 980 has 4 gb besides that there is not difference between 970 and 980 i think besides the 980 has more cores

still waiting for my watercooled 980 to be delivered
those algo don't use much vram, it shouldn't be a problem (actually there isn't much algo which use more than 2Gb)
hero member
Activity: 774
Merit: 500
Lazy Lurker Reads Alot
New build with more hashing power in most algos.

from release 38 we have

-faster keccak
-faster fugue
-faster hamsi
-faster aes
-faster echo
-some bugfixes

1.5.39(sp-MOD) is available here: (15-feb-2015)

https://github.com/sp-hash/ccminer/releases/tag/1.5.39

The sourcecode is available here:

https://github.com/sp-hash/ccminer


did you check how much memory is being needed for this, could it be the 970 mem limit at 3.5 Gb?

The 980 has 4 gb besides that there is not difference between 970 and 980 i think besides the 980 has more cores

still waiting for my watercooled 980 to be delivered
newbie
Activity: 14
Merit: 0
qubit has slowed from 3609 kH/s on the previous version to 3571 kH/s on the current one - GTX 850M Linux, using "git pull" to update, build.sh has -O3 passed to configure
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Tonight I will try to use the BFI instruction to modify pointers, I will allign sharemem to 1024 bytes boundaries. and insert the 8 bit offset in bits 2-10 directly into the pointer.

Today a byteperm+shift+add is needed to calculate the sharedmem adress
(echo/shavite)
Hopefully AES will be faster.

3 instructions down to 1 instruction.

In echo alone there are 5120 random sharemem accesses. If this works, 1 round of Echo(hashalgo #11 in x11) will use around 10240 less instructions to do the same work. IMHO massive improvement.

In my bitslice groestl improvement i removed around 1000 instructions and the x11  hash got 50-100KHASH faster on the 750ti.
With 10 000 instructions removed, perhaps 500KHASH faster is possible.

But I assume 32bit pointers, so it might not work on 64bit builds. And it might not work in cuda code alone. Perhaps I need to rewrite AESround function to asm.
And the BFI might be slower than other instructions, so less instructions but more expensive instructions.

anyway. will try it out.
Jump to: