Author

Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 973. (Read 3426989 times)

hero member
Activity: 756
Merit: 502
cbuchner1, did you note my earlier post about autotune problems and K kernel performance regression?

okay, I have just replaced the ailing PSU in my main development PC, which allows me to put more stress on the GPUs again without it turning off unexpectedly.

So that regression really is bad. 254 kHash/s to 204 kHash/s with same kernel launch parameters between 2013-12-18 and current github.
That's a 20% drop in performance. I might play around a bit to see what I can find.

I did not find the same problem with the T kernel, even though it underwent very similar changes!

EDIT1: the majority of the discrepancy stems from my redefinition of what "warp" means in Dave's Kepler kernel (to be more in line with the CUDA definition of a warp) Hence the equivalent launch config for the current github release has to use four times the number of blocks to be comparable. So I have to go from -l K7x32 to -l K28x32. Then I end up with a drop from 254 kHash/s to 220 kHash/s only. Still bad, but not quite that much.

EDIT2: I find my "simplifications" in read_keys_direct and write_keys_direct to be the culprit. Turns out this has a huge performance impact, despite requiring much less instructions.

Christian

sr. member
Activity: 350
Merit: 250
Repeating what someone else said lmao. Sarcasm not needed
full member
Activity: 182
Merit: 100
I believe its to do with the memory bus speed limiting the amount of memory it can use
You have any idea how logical that sounds?
/sarcasm off
Probably has to do with the whole 32/64 bit thing, running a x64 build doesn't nessecarily fix that either.
sr. member
Activity: 350
Merit: 250
I believe its to do with the memory bus speed limiting the amount of memory it can use
legendary
Activity: 2002
Merit: 1051
ICO? Not even once.
What's the reason behind failing to allocate more than 3GB of VRAM on Titans?
It seems that wherever you look, games, applications, whatever it always has problems on that front.
sr. member
Activity: 350
Merit: 250
Even getting it to use more memory should give me an increase. I say should
hero member
Activity: 756
Merit: 502
No problem.
I am hoping to find a way to allocate some more memory and gpu power. And with the improvement christian has in store it should jump up a lot more

make that a "might" jump up a lot more.

I've had my fair share of optimization failures...
sr. member
Activity: 350
Merit: 250
No problem.
I am hoping to find a way to allocate some more memory and gpu power. And with the improvement christian has in store it should jump up a lot more
13G
newbie
Activity: 17
Merit: 0
Results from my latest build, here is my launch config i stuck with and the results for -L2 to -L6

./cudaminer -a scrypt-jane -H 0 -i 0 -d 0 -l T138x2 -o http://127.0.0.1:3339 -u user -p pass -D -L4

-L2 T68x2 - 4.30-4.46
-L3 T68x3 - 4.61-4.87
-L4 T138x2 - 4.89-5.29 - avg. 4.98
-L5 T69x4 - 4.90-5.3 - avg. 5.05
-L6 T108x4 - 4.64-5.2 - avg. 4.65

running with -L4 i have noticed that it has now settled down at 4.96 - 5.05 being on for 4 hours, and i still have a lot of system usage back. so im wondering how much it is actually using

memory wise it is using 2412MiB /  3071MiB
not sure on actual gpu usage. wonder if i could get more memory usage then i am now, may get me more out of it


Great improvement! Thank you!
GTX TITAN  with "-a scrypt-jane -d 0 -i 0 -H 2 -C 0 -m 0 -b 32768 -L 5 -l T69x4 -s 120" now 4.7khash/s !
full member
Activity: 125
Merit: 100
But somehow I do not see the advantage of this. Typically launch configurations are determined that a single stream is already fully loading the GPU's multiprocessors.

If this were true my stream test would have given me a lower hash rate due to overhead but it doubled.  Just because MSI Afterburner or another program is reporting a 90 something % GPU usage doesn't mean streams won't help.  When I ran the two kernels I ran all A's first, synced, then the B's.  So as long as the batch size for the NFactor is small enough to spawn off enough kernels, they'll run concurrently.  Again, another optimization that won't work so well for lower NFactors.  But I was actually seeing 99-100% GPU usage with the doubled hash rate.  I had the same thought as you before I discovered streams but the NVIDIA Visual Profiler and the sample code convinced me otherwise.  I had another raster compression routine when streamed give me a 40% increase when I thought it was already maxed out.

I'm just talking this out here.  I know with the current state it won't be valid results due to the kernels tripping all over each others memory space.  I just wanted to see the potential.

I'm going to take your suggestion, since you know the code, and see if I can understand it well enough to break it up into 4 regions.  So far, I can tell this will be a steep learning curve.

Thanks

Joe
sr. member
Activity: 350
Merit: 250
Results from my latest build, here is my launch config i stuck with and the results for -L2 to -L6

./cudaminer -a scrypt-jane -H 0 -i 0 -d 0 -l T138x2 -o http://127.0.0.1:3339 -u user -p pass -D -L4

-L2 T68x2 - 4.30-4.46
-L3 T68x3 - 4.61-4.87
-L4 T138x2 - 4.89-5.29 - avg. 4.98
-L5 T69x4 - 4.90-5.3 - avg. 5.05
-L6 T108x4 - 4.64-5.2 - avg. 4.65

running with -L4 i have noticed that it has now settled down at 4.96 - 5.05 being on for 4 hours, and i still have a lot of system usage back. so im wondering how much it is actually using

memory wise it is using 2412MiB /  3071MiB
not sure on actual gpu usage. wonder if i could get more memory usage then i am now, may get me more out of it
newbie
Activity: 12
Merit: 0
Where can I find the 2014-01-17 version?
hero member
Activity: 756
Merit: 502
Thanks, has the -b parameter an influence when running it -i 0?
In any case, it didn't change the speed visibly.

-b still has an influence, as in reducing overhead for CUDA kernel calls. Bigger chunks of data to work with means less overhead.
member
Activity: 106
Merit: 10
Thanks, has the -b parameter an influence when running it -i 0?
In any case, it didn't change the speed visibly.
hero member
Activity: 756
Merit: 502
Getting a stable 1.35 - 1.45 kh/s on my GT-640 with default clocks and mining YaCoin.
There are a few "does not validate on CPU" results though, and they also don't seem to count:

I wish I knew what is causing these... try passing a -b 8192 for a bit more speed.
member
Activity: 106
Merit: 10
Getting a stable 1.35 - 1.45 kh/s on my GT-640 with default clocks and mining YaCoin.
There are a few "does not validate on CPU" results though, and they also don't seem to count:
Code:
[2014-01-29 15:11:26] GPU #1: GeForce GT 640 result does not validate on CPU (i=23, s=0)!
[2014-01-29 15:11:27] GPU #1: GeForce GT 640, 1.45 khash/s
[2014-01-29 15:11:27] accepted: 51/51 (100.00%), 1.45 khash/s (yay!!!)

Otherwise it's running smooth on Windows. Build is built today.
Start Parameters:
cudaminer.exe -a scrypt-jane -i 0 -l K27x2 -o http://yac.coinmine.pl:8882 -O ...:... -H 2 -d 1
hero member
Activity: 756
Merit: 502

To be more specific, I was using 4 streams on nv_scrypt_core_kernelA and nv_scrypt_core_kernelB inside the NVKernel::run_kernel.  So those are the kernels I was referring to.  Too bad the code in these kernels looks like witchcraft to me at the moment. LOL

what you have to know is that the "A" named kernels writes to the scratchpad (yes, the ENTIRE scratchpad) and kernels labeled "B" reads from random positions in the scratchpad. So there is an A->B dependency, first A has to complete before B can run.

If you still want to try running multiple streams, divide the hashing (nonce) range into 4 equally sized regions

then you can run

A -> B region 1
A -> B region 2
A -> B region 3
A -> B region 4

all simultaneously on 4 streams, as their scratchpad areas do not overlap. But somehow I do not see the advantage of this. Typically launch configurations are determined that a single stream is already fully loading the GPU's multiprocessors.


I currently use two streams for different hashing (nonce) ranges, but a fully concurrent execution would only be allowed if you allocated several scratchpads (one per stream). Considering that the video card memory is a scarce resource this is probably not the best idea. Especially with scrypt-jane coins this is a problem.

Christian
full member
Activity: 182
Merit: 100
So my 780 getting over 5 isnt too bad then

but 6 or 7 would be nicer.

I have one optimization in mind that swaps the state of threads within the lookup_gap loop. The intention is to order threads by the loop trip count (some have to run for 0 loops, others a couple more up to the specified lookup_gap). By ordering them, some of the warps will terminate much earlier and not consume any computational resources.

This would (in theory) reduce the workload nearly by factor 2, but it introduces some overhead for sorting the threads, and for shuffling the state around. Whether a net speed gain remains ,  that is yet to be seen.

I will save that optimization for February (it would delay this release...)

Christian
I shall happily beta test this when it is out in February  Grin
hero member
Activity: 756
Merit: 502
I have Visual Studio 2012, but I can't load the solution file. So I probably have to wait a few more days.

CUDA 5.5 is installed? VS 2012 should be able to upgrade the solution file automatically. But then you will have to make sure that all the other dependencies are available (OpenSSL, pthreads, libcurl....)
full member
Activity: 125
Merit: 100
I've been experimenting with streams on the Y kernel.  So far I've tested this on YAC and got 5.3 khash/s on my 660 Ti.  Too bad it doesn't validate on the CPU though.  The kernel must not be concurrent safe, =).  

yes. right. there is one scratchpad but two streams. The scrypt_core kernels have to be serialized, or they would destroy each other's scratchpad. This is why I am using CUDA events.  

Some overlap of memcpy and kernels would be desired (not happening now due to issue order of commands), and possibly the SHA256/Keccak kernels of one stream could be executed concurrently with the scrypt_core kernels of the other stream. This is also not happening now because my CUDA events currently also serialize these (need to change when events are generated and synchronized upon).

I intend to get rid of memcpy alltogether by checking hashes on the GPU instead, so the memcpy/kernel overlap issue is moot.

Christian


To be more specific, I was using 4 streams on nv_scrypt_core_kernelA and nv_scrypt_core_kernelB inside the NVKernel::run_kernel.  So those are the kernels I was referring to.  Too bad the code in these kernels looks like witchcraft to me at the moment. LOL
Jump to: