Author

Topic: CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 1158. (Read 2347599 times)

sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
With uint2 it uses more registers and spills to memory, so I increaced the launchbound to 128 regs.
The codesize is also bigger.
On the 780ti you should probobobly not unroll all the loops.

There are more speedups to come. Still some easy pickings.
legendary
Activity: 1400
Merit: 1050
1.5.40(sp-MOD) is available here: (27-feb-2015)

https://github.com/sp-hash/ccminer/releases/tag/1.5.40

The sourcecode is available here:

https://github.com/sp-hash/ccminer

Differences from release 39

wirlcoin +12%

Faster hash in

Wirlpool(x15,x17)
fugue(x13,x14,x15,x17)
shavite(x11,x13,x14,x15,x17) (tiny speedup)
shabal(x11,x13,x14,x15,x17)(tiny speedup)

I try sometime ago the rotation but I wasn't convince, however I don't think I tried it with uint2 since then (I hate working on whirlpool... takes forever to compile).

I get +20MH/s on whirlpoolx on gtx980
        +10MH/s on 750ti
but -30MH on 780ti
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
1.5.40(sp-MOD) is available here: (27-feb-2015)

https://github.com/sp-hash/ccminer/releases/tag/1.5.40

The sourcecode is available here:

https://github.com/sp-hash/ccminer

Differences from release 39

wirlcoin +12%

Faster hash in

Wirlpool(x15,x17)
fugue(x13,x14,x15,x17)
shavite(x11,x13,x14,x15,x17) (tiny speedup)
shabal(x11,x13,x14,x15,x17)(tiny speedup)



sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
Yeah, a few rotations and you can down the size, still ouch.

The Hashing function can probobly be improved more, but 12% is ok for now.
I have also submitted a speedup in fugue (x13) precalced some hash and removed instructions.
Building release 40 now.
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
I have rewritten wirlpool hash. 12% faster when mining wirlcoin (750ti)
x15 is +20khash(750ti)
Will cleanup abit and submit to github.
Sounds like you're still using tables...

yes. but the table is 1/8 the size.
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
I have rewritten wirlpool hash. 12% faster when mining wirlcoin (750ti)
x15 is +20khash(750ti)

Will cleanup abit and submit to github.
legendary
Activity: 3164
Merit: 1003
Need help on DGB coin Qubit algo Theblocksfactory I think needs a setting that I don't understand and he doesn't,  all other pools on this algorithm work fine.
On Qubit algo... before #33 we needed a -f 236 and now we don't. Now on this pool since 6 months ago I never got ccminer to work properly. With the older versions I needed to restart the program every 60 seconds to get the pool at my true hashrate. With #39 , no -f 236 needed , it works fine except it only excepts exactly 1/2 my hashrate. I think its a setting the pool owner needs to make. Again I tried this on another pools and it works fine. Any thoughts on this please? Please. ps The other pools have so little hash rate they only hit a block once in awhile.
Thx

If a pool is showing half the hashrate chances are you're doing twice the expected work so doubling your difficulty divide factor (--diff or -f) is what's probably missing. The default is 1 so you should try 2. Conversely, if it only accepts half the shares then you're sending smaller chunks of work then what the pool expects in which case halving the diff helps (-f 0.5). If there are still rejected shares try lowering the values to like -f 0.0078125 or -f 0.00390625 to offset the default 128/256 multipliers while checking the pool's reported hashrate.
-f 0.5  divides it in half  so total= 1/4 hash rate. I did try on another pool and its fine but this amd pool is s***. theblocksfactory  I tried 2 but over shares. So I come to the conclusion that it theblocksfactory pool.
I'm in http://digihash.co very good no problems. Smiley

Theblocksfactory is weird. When their vardiff starts climbing it throws rejects so it goes back and repeats. Anyway, you can use a fixed minimum vardiff and it seems for a 6 card 750 Ti rig 4 (.workername_diff4) works fine with -f 256 with release 39.
Thanks
Thats better..but getting alot of booo's with shares above target.  Funny that we have to use the -f 256 for that.
Edit: With diif4 and -f 256 hash went from 50% to 75%. So I have to adjust the diff4 to 2 or 8 ect to see what happens when I get a chance.
legendary
Activity: 2002
Merit: 1051
ICO? Not even once.
Need help on DGB coin Qubit algo Theblocksfactory I think needs a setting that I don't understand and he doesn't,  all other pools on this algorithm work fine.
On Qubit algo... before #33 we needed a -f 236 and now we don't. Now on this pool since 6 months ago I never got ccminer to work properly. With the older versions I needed to restart the program every 60 seconds to get the pool at my true hashrate. With #39 , no -f 236 needed , it works fine except it only excepts exactly 1/2 my hashrate. I think its a setting the pool owner needs to make. Again I tried this on another pools and it works fine. Any thoughts on this please? Please. ps The other pools have so little hash rate they only hit a block once in awhile.
Thx

If a pool is showing half the hashrate chances are you're doing twice the expected work so doubling your difficulty divide factor (--diff or -f) is what's probably missing. The default is 1 so you should try 2. Conversely, if it only accepts half the shares then you're sending smaller chunks of work then what the pool expects in which case halving the diff helps (-f 0.5). If there are still rejected shares try lowering the values to like -f 0.0078125 or -f 0.00390625 to offset the default 128/256 multipliers while checking the pool's reported hashrate.
-f 0.5  divides it in half  so total= 1/4 hash rate. I did try on another pool and its fine but this amd pool is s***. theblocksfactory  I tried 2 but over shares. So I come to the conclusion that it theblocksfactory pool.
I'm in http://digihash.co very good no problems. Smiley

Theblocksfactory is weird. When their vardiff starts climbing it throws rejects so it goes back and repeats. Anyway, you can use a fixed minimum vardiff and it seems for a 6 card 750 Ti rig 4 (.workername_diff4) works fine with -f 256 with release 39.
legendary
Activity: 3164
Merit: 1003
Need help on DGB coin Qubit algo Theblocksfactory I think needs a setting that I don't understand and he doesn't,  all other pools on this algorithm work fine.
On Qubit algo... before #33 we needed a -f 236 and now we don't. Now on this pool since 6 months ago I never got ccminer to work properly. With the older versions I needed to restart the program every 60 seconds to get the pool at my true hashrate. With #39 , no -f 236 needed , it works fine except it only excepts exactly 1/2 my hashrate. I think its a setting the pool owner needs to make. Again I tried this on another pools and it works fine. Any thoughts on this please? Please. ps The other pools have so little hash rate they only hit a block once in awhile.
Thx

If a pool is showing half the hashrate chances are you're doing twice the expected work so doubling your difficulty divide factor (--diff or -f) is what's probably missing. The default is 1 so you should try 2. Conversely, if it only accepts half the shares then you're sending smaller chunks of work then what the pool expects in which case halving the diff helps (-f 0.5). If there are still rejected shares try lowering the values to like -f 0.0078125 or -f 0.00390625 to offset the default 128/256 multipliers while checking the pool's reported hashrate.
-f 0.5  divides it in half  so total= 1/4 hash rate. I did try on another pool and its fine but this amd pool is s***. theblocksfactory  I tried 2 but over shares. So I come to the conclusion that it theblocksfactory pool.
I'm in http://digihash.co very good no problems. Smiley
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
The CUDA and OpenCL code for Whirlpool consists of lookups into huge tables - which sucks for the GPU;
The lookup is done in shared memory and is 1 cycle, but the internal RISC cpu needs 4 instructions to do the lookup (byteperm/add/shift/move)
With the BFINS instruction and alligned memroy buffers this can be reduced to 2 instructions, although I failed to implement it in my first attempt (AES)
I haven't done CUDA in quite a while, but here's a tip about AMD - using fucktons of LDS is bad for you. It reduces the waves in flight - more waves in flight usually mean more performance, up to a point.

The maxwell can do 2 instructions per clockcycle, but only one cycle when the instruction is using shared/const memory. Normal superscalar design. Thats why I normally move constants into the instruction cache.  Just need to make sure that the codesize fit the cache..
legendary
Activity: 2002
Merit: 1051
ICO? Not even once.
Need help on DGB coin Qubit algo Theblocksfactory I think needs a setting that I don't understand and he doesn't,  all other pools on this algorithm work fine.
On Qubit algo... before #33 we needed a -f 236 and now we don't. Now on this pool since 6 months ago I never got ccminer to work properly. With the older versions I needed to restart the program every 60 seconds to get the pool at my true hashrate. With #39 , no -f 236 needed , it works fine except it only excepts exactly 1/2 my hashrate. I think its a setting the pool owner needs to make. Again I tried this on another pools and it works fine. Any thoughts on this please? Please. ps The other pools have so little hash rate they only hit a block once in awhile.
Thx

If a pool is showing half the hashrate chances are you're doing twice the expected work so doubling your difficulty divide factor (--diff or -f) is what's probably missing. The default is 1 so you should try 2. Conversely, if it only accepts half the shares then you're sending smaller chunks of work then what the pool expects in which case halving the diff helps (-f 0.5). If there are still rejected shares try lowering the values to like -f 0.0078125 or -f 0.00390625 to offset the default 128/256 multipliers while checking the pool's reported hashrate.
legendary
Activity: 3164
Merit: 1003
Need help on DGB coin Qubit algo Theblocksfactory I think needs a setting that I don't understand and he doesn't,  all other pools on this algorithm work fine.
On Qubit algo... before #33 we needed a -f 236 and now we don't. Now on this pool since 6 months ago I never got ccminer to work properly. With the older versions I needed to restart the program every 60 seconds to get the pool at my true hashrate. With #39 , no -f 236 needed , it works fine except it only excepts exactly 1/2 my hashrate. I think its a setting the pool owner needs to make. Again I tried this on another pools and it works fine. Any thoughts on this please? Please. ps The other pools have so little hash rate they only hit a block once in awhile.
Thx
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
The CUDA and OpenCL code for Whirlpool consists of lookups into huge tables - which sucks for the GPU;

The lookup is done in shared memory and is 1 cycle, but the internal RISC cpu needs 4 instructions to do the lookup (byteperm/add/shift/move)
With the BFINS instruction and alligned memroy buffers this can be reduced to 2 instructions, although I failed to implement it in my first attempt (AES)
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
would it be worth it though sp? ...
#crysx

Yes, because the same algo is used in x15 (the last of the hashing function) rewriting it will improve the x15 speed alot..
legendary
Activity: 2912
Merit: 1091
--- ChainWorks Industries ---
Looks like the cuda implementation of wirlpool can use 8 times less memory access by a small rewrite.


If Wirlpoolx is just wirlpool with an extra xor pass I think alot of work is needed to get close to the Wolf0 speed.


the 750ti only does 4,4 MHASH on wirlpool.



would it be worth it though sp? ...

#crysx
sp_
legendary
Activity: 2926
Merit: 1087
Team Black developer
Looks like the cuda implementation of wirlpool can use 8 times less memory access by a small rewrite.


If Wirlpoolx is just wirlpool with an extra xor pass I think alot of work is needed to get close to the Wolf0 speed.


the 750ti only does 4,4 MHASH on wirlpool. (this overview is a bit old, the latest miner is faster)

legendary
Activity: 2912
Merit: 1091
--- ChainWorks Industries ---
Quote
I can't believed it stayed cpu only for 2 months...
*lol* ...but to late... amd is already there.

edit: and whirlpool seems not that slow on amd - 50mhash per r9 270

100MH/s+ on 270X.

wolf - we can never know whether you are quoting YOUR miner / optimizations - or the standard that is available to the public ...

so this figure you have quoted mate - yours or public? ...

#crysx

Mine - he pointed out it wasn't slow on AMD; he's right.

damn ...

and how to get hold of your one? with the appropriate settings? ...

Wink

#crysx

You know the answer to that. But, anyway, I'm working on something more epic.

The CUDA and OpenCL code for Whirlpool consists of lookups into huge tables - which sucks for the GPU; that's CPU code. Even with my current code, I've noticed beyond a certain point, it doesn't matter how high I clock, because it's stalling on memory accesses. Those tables have so got to go away.

I have gotten the reference implementation down in C - surprisingly hard, seeing as it appears there's no code anywhere for it. This consists of mostly the block cipher W that was created with Whirlpool, which is based on AES - and I know AES backwards and forwards. Small issue - it's got a 2048 byte table for the multiplication, then a 256 byte Sbox.

I took the 2048 byte table used for the multiplies and reduced it to one 8-byte table by doing them manually - then I got rid of that by inlining them as constants. The S-box I split into its parts - three S-boxes containing 16 entries of 4 bits each, and bitsliced them. Does valid hashes so far, but I have a bit further to go before it's really GPU-ready.

wow - so you have been a VERY busy lil wolfie then ... damn ...

so when will you expected final implementation come? ...

btw - pm for the 'you know the answer to that' situation with your idea of how that can be done ...

just trying to make the farm work THAT MUCH better - and that requires optimizations ... sooo - pm me please with what needs to be done on my end to get it organized ...

btw - the completion of the exchange from amd to nvidia is almost complete with the farm - so i can still run / test the optimizations with the gigabyte 280x oc cards left ( 16 of them currently ) ... once those are gone - the farm will be nothing but gigabyte 750ti oc lp cards ...

hence the reason for my interest in what / when / where / how / and how much ... Wink

#crysx
legendary
Activity: 2912
Merit: 1091
--- ChainWorks Industries ---
Quote
I can't believed it stayed cpu only for 2 months...
*lol* ...but to late... amd is already there.

edit: and whirlpool seems not that slow on amd - 50mhash per r9 270

100MH/s+ on 270X.

wolf - we can never know whether you are quoting YOUR miner / optimizations - or the standard that is available to the public ...

so this figure you have quoted mate - yours or public? ...

#crysx

Mine - he pointed out it wasn't slow on AMD; he's right.

damn ...

and how to get hold of your one? with the appropriate settings? ...

Wink

#crysx
legendary
Activity: 2912
Merit: 1091
--- ChainWorks Industries ---
KlausT has added support in his fork:

https://github.com/KlausT/ccminer


There's no pluck there. DJM34 made a fork which runs at around 2.3kh/s per 750 Ti while sgminer does ~3.7kh/s.

kool ...

will get it all sorted tomorrow ...

i think the last time i tried to compile i was gettign errors - and djm34 pointed out that i needed to use the latest cuda 6.5 ...

if thats the case for the sgminer / ccminer compiles - i will have a bit of work to do to build another linux machine thats more up to date than the fedora 19 x64 that i have ...

:|

#crysx
you just need to update cuda...

tanx mate ...

the cuda repo doesnt allow that update or upgrade ...

doing it manually means a crapload of work on our part for the farm - as the 'standardization' of the farm is incomplete ...

different motherboards - cpus and the like ...

im looking at an easier way of upgrading the whole farm to the latest cuda without a 'one by one' approach ...

ill build a fedora 20 x64 test machine - which will allow all this to happen ( and also finally allow testing of your neoscrypt miner ) which will make it easier to roll out when the hardware changes happen too ...

tanx again ...

#crysx
legendary
Activity: 2912
Merit: 1091
--- ChainWorks Industries ---
Quote
I can't believed it stayed cpu only for 2 months...
*lol* ...but to late... amd is already there.

edit: and whirlpool seems not that slow on amd - 50mhash per r9 270

100MH/s+ on 270X.

wolf - we can never know whether you are quoting YOUR miner / optimizations - or the standard that is available to the public ...

so this figure you have quoted mate - yours or public? ...

#crysx
Jump to: