Author

Topic: CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 778. (Read 2347664 times)

member
Activity: 98
Merit: 10
Used allanmac's code (https://gist.github.com/allanmac/f91b67c112bcba98649d) from devtalk.nvidia.com to test TLB thrashing and compiled it with nvcc -m32. It didn't alleviate the TLB thrashing issue; 970 still dives like a stone past 2GB. Not sure if this is representative of memory bandwidth in Ethash though, just sharing to see if anyone can tweak and improve the TLB situation.

Perhaps compile the ether miner for 32 bit's will help? Cached Pointersizes will go from 64bit to 32 (and double the tlb limit?) You need to remove the cpu verfication code because it use 64bit libraries I think..

Thought of that but it's going to be troublesome. You only have a 4GB address space, with windows already sucking up ~half. Then you have to load the 1.3 GB DAG from disk, and allocate 1.3GB of GPU RAM (which, AFAIK sits in the same space, although it isn't pinned to host). This doens't fit. So then you would have to read the DAG from disk in small chunks and copy it cover to GPU RAM. And when that's all done, you will have to pass on all solutions to a special light version of ethminer, that does light verification, is it can't load a DAG into RAM for the same reasons. Or you simply don't verify and risk some Boo's.

Then, when that's all done, you're not even sure if it fixes the problem. You could try getting a 32-bit version of dagSimCL to work.

I believe Epsylon3/tpruvot has been trying to get a 32-bit version of ethminer to work a while back. Can't find the source anymore.
legendary
Activity: 1470
Merit: 1114
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.
The cpu verification is only done when the gpu find a solution.
I know why changing the verification code made things faster. I wa scpumining 8
threads at the time so it was slowing down the CPU.

But in ccminer you can just remove the verification. It's there so that you can check if you break the hash when you change something.

Tried that in cpuminer, didn't help. I only managed to get another 1% out of c11, not sure why, expected more,
will take another look.

No other algos benefit from the fast ctx reinit but you should try it in ccminer, the GPU kernel,  that is.
legendary
Activity: 1470
Merit: 1114
You are an ssembly language guy, do you reorder instructions to maximize instruction throughput.
It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock,
how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize
reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed
up the hot spots.

This is something the compiler is very good at. The cudacore is a 3 + operation risc processor with up to 256 registers.
It is buildt for the compiler..

Sometimes you need to move code around, manually unroll some loops etc.. Verify the result with disassembling. (this is what DJM34 is calling random stuff)

But don't let the codesize grow to big, the instruction cache is small.
...

While you where trolling my thread I added another 0.4% in the decred algo.
I will try to do 5% and include it in my donation miner.

I wastalking more about performing loads as soon as possible to give time for mem to respond before
you need the data. It also fills the cache line for susequent loads. If cuda supports read priority you
can even issue a store before a load and the load will have priority. You just have to watch for register
conflicts.

There is also issuing different types of instructions on the same clock to improve superscalar
operation.

These kinds of things are hard for a normal compiler to do because it is specific to each processor,
but if anyone can do it it'd cuda because thy have one HW architecture, one run time system and
one compiler.

And another thing, you trolled me first. Smiley
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.
The cpu verification is only done when the gpu find a solution.
I know why changing the verification code made things faster. I wa scpumining 8
threads at the time so it was slowing down the CPU.

But in ccminer you can just remove the verification. It's there so that you can check if you break the hash when you change something.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
You are an ssembly language guy, do you reorder instructions to maximize instruction throughput.
It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock,
how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize
reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed
up the hot spots.

This is something the compiler is very good at. The cudacore is a 3 + operation risc processor with up to 256 registers.
It is buildt for the compiler..

Sometimes you need to move code around, manually unroll some loops etc.. Verify the result with disassembling. (this is what DJM34 is calling random stuff)

But don't let the codesize grow to big, the instruction cache is small.
...

While you where trolling my thread I added another 0.4% in the decred algo.
I will try to do 5% and include it in my donation miner.
legendary
Activity: 1470
Merit: 1114
You are an ssembly language guy, do you reorder instructions to maximize instruction throughput.
It requires detail knowledge of theprocessor such as how many instructions can be fetched per clock,
how many can be executed per clock, how deep is the memory buffer, dies it delay writes to prioritize
reads?, how big is a cache line, etc. I know none of this stuff, maybe you do and could use it to speed
up the hot spots.
legendary
Activity: 1470
Merit: 1114
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.

The cpu verification is only done when the gpu find a solution.

I know why changing the verification code made things faster. I wa scpumining 8
threads at the time so it was slowing down the CPU.
legendary
Activity: 1470
Merit: 1114
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.

The cpu verification is only done when the gpu find a solution.

I may not have realized I was looking at verification code at the time but I know what it is.
Maybe my changes can be applied to the GPU code and you'll get your 30%
legendary
Activity: 1470
Merit: 1114
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

My changes have nothing to do with avoiding branches but avoiding work.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
Whatever it is it's faster.

The cpu verification is only done when the gpu find a solution.
legendary
Activity: 1470
Merit: 1114
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.

Whatever it is it's faster.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
this is in the cpu verification code. The gpu code is different. There we have precalc tables of the states to avoid conditional branches.
legendary
Activity: 1470
Merit: 1114
when is the last time you delivered 30% in less than an hour?

Today. your quark kernel.

Since skein is much faster than groestl we only do skein and throw away 50% of the hashes.

    if (hash[0] & 0x8)
    {
        sph_groestl512_init(&ctx_groestl);
        sph_groestl512 (&ctx_groestl, (const void*) hash, 64);
        sph_groestl512_close(&ctx_groestl, (void*) hash);
    }
    else
    {
        sph_skein512_init(&ctx_skein);
        sph_skein512 (&ctx_skein, (const void*) hash, 64);
        sph_skein512_close(&ctx_skein, (void*) hash);
    }


There was an optimization made in cpuminer that  if it was determined that a second
round of groestl was necessary the existing hashes would be thrown away on the belief
it would take longer to complete the second groestl than to start over. It didn't work.

However, I might try ccminer's logic. cpuminer uses a state machine as
the engine. ccminer just uses a simple if.

I'm also going to look at other contexts. selctively reinitializing necessary fields may be
quicker thn the current implementation of copying a saved initialiazed context.
Both are quicker than what ccminer does.
legendary
Activity: 1470
Merit: 1114
when is the last time you delivered 30% in less than an hour?

Today. your quark kernel.

That doesn't make sense.

Anyway here it is. This one's on me.

https://drive.google.com/file/d/0B0lVSGQYLJIZbllYWENUV0l0VnM/view?usp=sharing
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
when is the last time you delivered 30% in less than an hour?

Today. your quark kernel.

Since skein is much faster than groestl we only do skein and throw away 50% of the hashes.

    if (hash[0] & 0x8)
    {
        sph_groestl512_init(&ctx_groestl);
        sph_groestl512 (&ctx_groestl, (const void*) hash, 64);
        sph_groestl512_close(&ctx_groestl, (void*) hash);
    }
    else
    {
        sph_skein512_init(&ctx_skein);
        sph_skein512 (&ctx_skein, (const void*) hash, 64);
        sph_skein512_close(&ctx_skein, (void*) hash);
    }

    if (hash[0] & 0x8)
    {
        don't hash the rest
    }
    else
    {
        sph_skein512_init(&ctx_skein);
        sph_skein512 (&ctx_skein, (const void*) hash, 64);
        sph_skein512_close(&ctx_skein, (void*) hash);
    }

legendary
Activity: 1470
Merit: 1114
I just improved quark on r74. Do you want it?
1.5.74-jdd+
What did you do?
Exactly what I told you. You can have it if you open some of you private stash.
Release 78 is faster.
Want me to do 78 too?

Yes I want you to do +30%

when is the last time you delivered 30% in less than an hour?
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
I just improved quark on r74. Do you want it?
1.5.74-jdd+
What did you do?
Exactly what I told you. You can have it if you open some of you private stash.
Release 78 is faster.
Want me to do 78 too?

Yes I want you to do +30%
legendary
Activity: 1470
Merit: 1114
I just improved quark on r74. Do you want it?
1.5.74-jdd+
What did you do?
Exactly what I told you. You can have it if you open some of you private stash.

Release 78 is faster.

Want me to do 78 too?
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
I just improved quark on r74. Do you want it?
1.5.74-jdd+
What did you do?
Exactly what I told you. You can have it if you open some of you private stash.

Release 78 is faster.
legendary
Activity: 1470
Merit: 1114
I just improved quark on r74. Do you want it?
1.5.74-jdd+

What did you do?

Exactly what I told you. You can have it if you open some of you private stash.
Jump to: