[ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 973.

cbuchner1

hero member

Activity: 756

Merit: 502

Quote from: ghur on January 29, 2014, 02:25:10 AM

cbuchner1, did you note my earlier post about autotune problems and K kernel performance regression?

okay, I have just replaced the ailing PSU in my main development PC, which allows me to put more stress on the GPUs again without it turning off unexpectedly.

So that regression really is bad. 254 kHash/s to 204 kHash/s with same kernel launch parameters between 2013-12-18 and current github.
That's a 20% drop in performance. I might play around a bit to see what I can find.

I did not find the same problem with the T kernel, even though it underwent very similar changes!

EDIT1: the majority of the discrepancy stems from my redefinition of what "warp" means in Dave's Kepler kernel (to be more in line with the CUDA definition of a warp) Hence the equivalent launch config for the current github release has to use four times the number of blocks to be comparable. So I have to go from -l K7x32 to -l K28x32. Then I end up with a drop from 254 kHash/s to 220 kHash/s only. Still bad, but not quite that much.

EDIT2: I find my "simplifications" in read_keys_direct and write_keys_direct to be the culprit. Turns out this has a huge performance impact, despite requiring much less instructions.

Christian

bigjme

sr. member

Activity: 350

Merit: 250

Repeating what someone else said lmao. Sarcasm not needed

ManIkWeet

full member

Activity: 182

Merit: 100

Quote from: bigjme on January 29, 2014, 04:26:51 PM

I believe its to do with the memory bus speed limiting the amount of memory it can use

You have any idea how logical that sounds?
/sarcasm off
Probably has to do with the whole 32/64 bit thing, running a x64 build doesn't nessecarily fix that either.

bigjme

sr. member

Activity: 350

Merit: 250

I believe its to do with the memory bus speed limiting the amount of memory it can use

bathrobehero

legendary

Activity: 2002

Merit: 1051

ICO? Not even once.

What's the reason behind failing to allocate more than 3GB of VRAM on Titans?
It seems that wherever you look, games, applications, whatever it always has problems on that front.

bigjme

sr. member

Activity: 350

Merit: 250

Even getting it to use more memory should give me an increase. I say should

cbuchner1

hero member

Activity: 756

Merit: 502

Quote from: bigjme on January 29, 2014, 04:05:10 PM

No problem.
I am hoping to find a way to allocate some more memory and gpu power. And with the improvement christian has in store it should jump up a lot more

make that a "might" jump up a lot more.

I've had my fair share of optimization failures...

bigjme

sr. member

Activity: 350

Merit: 250

No problem.
I am hoping to find a way to allocate some more memory and gpu power. And with the improvement christian has in store it should jump up a lot more

13G

newbie

Activity: 17

Merit: 0

Quote from: bigjme on January 29, 2014, 12:29:55 PM

Results from my latest build, here is my launch config i stuck with and the results for -L2 to -L6

./cudaminer -a scrypt-jane -H 0 -i 0 -d 0 -l T138x2 -o http://127.0.0.1:3339 -u user -p pass -D -L4

-L2 T68x2 - 4.30-4.46
-L3 T68x3 - 4.61-4.87
-L4 T138x2 - 4.89-5.29 - avg. 4.98
-L5 T69x4 - 4.90-5.3 - avg. 5.05
-L6 T108x4 - 4.64-5.2 - avg. 4.65

running with -L4 i have noticed that it has now settled down at 4.96 - 5.05 being on for 4 hours, and i still have a lot of system usage back. so im wondering how much it is actually using

memory wise it is using 2412MiB / 3071MiB
not sure on actual gpu usage. wonder if i could get more memory usage then i am now, may get me more out of it

Great improvement! Thank you!
GTX TITAN with "-a scrypt-jane -d 0 -i 0 -H 2 -C 0 -m 0 -b 32768 -L 5 -l T69x4 -s 120" now 4.7khash/s !

whitesand77

full member

Activity: 125

Merit: 100

Quote from: cbuchner1 on January 29, 2014, 09:11:35 AM

But somehow I do not see the advantage of this. Typically launch configurations are determined that a single stream is already fully loading the GPU's multiprocessors.

If this were true my stream test would have given me a lower hash rate due to overhead but it doubled. Just because MSI Afterburner or another program is reporting a 90 something % GPU usage doesn't mean streams won't help. When I ran the two kernels I ran all A's first, synced, then the B's. So as long as the batch size for the NFactor is small enough to spawn off enough kernels, they'll run concurrently. Again, another optimization that won't work so well for lower NFactors. But I was actually seeing 99-100% GPU usage with the doubled hash rate. I had the same thought as you before I discovered streams but the NVIDIA Visual Profiler and the sample code convinced me otherwise. I had another raster compression routine when streamed give me a 40% increase when I thought it was already maxed out.

I'm just talking this out here. I know with the current state it won't be valid results due to the kernels tripping all over each others memory space. I just wanted to see the potential.

I'm going to take your suggestion, since you know the code, and see if I can understand it well enough to break it up into 4 regions. So far, I can tell this will be a steep learning curve.

Thanks

Joe

bigjme

sr. member

Activity: 350

Merit: 250

Results from my latest build, here is my launch config i stuck with and the results for -L2 to -L6

./cudaminer -a scrypt-jane -H 0 -i 0 -d 0 -l T138x2 -o http://127.0.0.1:3339 -u user -p pass -D -L4

-L2 T68x2 - 4.30-4.46
-L3 T68x3 - 4.61-4.87
-L4 T138x2 - 4.89-5.29 - avg. 4.98
-L5 T69x4 - 4.90-5.3 - avg. 5.05
-L6 T108x4 - 4.64-5.2 - avg. 4.65

running with -L4 i have noticed that it has now settled down at 4.96 - 5.05 being on for 4 hours, and i still have a lot of system usage back. so im wondering how much it is actually using

memory wise it is using 2412MiB / 3071MiB
not sure on actual gpu usage. wonder if i could get more memory usage then i am now, may get me more out of it

apluscarp

newbie

Activity: 12

Merit: 0

Where can I find the 2014-01-17 version?

cbuchner1

hero member

Activity: 756

Merit: 502

Quote from: patoberli on January 29, 2014, 10:39:38 AM

Thanks, has the -b parameter an influence when running it -i 0?
In any case, it didn't change the speed visibly.

-b still has an influence, as in reducing overhead for CUDA kernel calls. Bigger chunks of data to work with means less overhead.

patoberli

member

Activity: 106

Merit: 10

Thanks, has the -b parameter an influence when running it -i 0?
In any case, it didn't change the speed visibly.

cbuchner1

hero member

Activity: 756

Merit: 502

Quote from: patoberli on January 29, 2014, 09:15:16 AM

Getting a stable 1.35 - 1.45 kh/s on my GT-640 with default clocks and mining YaCoin.
There are a few "does not validate on CPU" results though, and they also don't seem to count:

I wish I knew what is causing these... try passing a -b 8192 for a bit more speed.

patoberli

member

Activity: 106

Merit: 10

Getting a stable 1.35 - 1.45 kh/s on my GT-640 with default clocks and mining YaCoin.
There are a few "does not validate on CPU" results though, and they also don't seem to count:

Code:

[2014-01-29 15:11:26] GPU #1: GeForce GT 640 result does not validate on CPU (i=23, s=0)!
[2014-01-29 15:11:27] GPU #1: GeForce GT 640, 1.45 khash/s
[2014-01-29 15:11:27] accepted: 51/51 (100.00%), 1.45 khash/s (yay!!!)

Otherwise it's running smooth on Windows. Build is built today.
Start Parameters:
cudaminer.exe -a scrypt-jane -i 0 -l K27x2 -o http://yac.coinmine.pl:8882 -O ...:... -H 2 -d 1

cbuchner1

hero member

Activity: 756

Merit: 502

Quote from: whitesand77 on January 29, 2014, 09:08:43 AM

To be more specific, I was using 4 streams on nv_scrypt_core_kernelA and nv_scrypt_core_kernelB inside the NVKernel::run_kernel. So those are the kernels I was referring to. Too bad the code in these kernels looks like witchcraft to me at the moment. LOL

what you have to know is that the "A" named kernels writes to the scratchpad (yes, the ENTIRE scratchpad) and kernels labeled "B" reads from random positions in the scratchpad. So there is an A->B dependency, first A has to complete before B can run.

If you still want to try running multiple streams, divide the hashing (nonce) range into 4 equally sized regions

then you can run

A -> B region 1
A -> B region 2
A -> B region 3
A -> B region 4

all simultaneously on 4 streams, as their scratchpad areas do not overlap. But somehow I do not see the advantage of this. Typically launch configurations are determined that a single stream is already fully loading the GPU's multiprocessors.

I currently use two streams for different hashing (nonce) ranges, but a fully concurrent execution would only be allowed if you allocated several scratchpads (one per stream). Considering that the video card memory is a scarce resource this is probably not the best idea. Especially with scrypt-jane coins this is a problem.

Christian

ManIkWeet

full member

Activity: 182

Merit: 100

Quote from: cbuchner1 on January 29, 2014, 08:50:34 AM

Quote from: bigjme on January 29, 2014, 08:47:35 AM

So my 780 getting over 5 isnt too bad then

but 6 or 7 would be nicer.

I have one optimization in mind that swaps the state of threads within the lookup_gap loop. The intention is to order threads by the loop trip count (some have to run for 0 loops, others a couple more up to the specified lookup_gap). By ordering them, some of the warps will terminate much earlier and not consume any computational resources.

This would (in theory) reduce the workload nearly by factor 2, but it introduces some overhead for sorting the threads, and for shuffling the state around. Whether a net speed gain remains , that is yet to be seen.

I will save that optimization for February (it would delay this release...)

Christian

I shall happily beta test this when it is out in February Grin

cbuchner1

hero member

Activity: 756

Merit: 502

Quote from: Espie on January 29, 2014, 09:04:00 AM

I have Visual Studio 2012, but I can't load the solution file. So I probably have to wait a few more days.

CUDA 5.5 is installed? VS 2012 should be able to upgrade the solution file automatically. But then you will have to make sure that all the other dependencies are available (OpenSSL, pthreads, libcurl....)

whitesand77

full member

Activity: 125

Merit: 100

Quote from: cbuchner1 on January 29, 2014, 08:43:53 AM

Quote from: whitesand77 on January 29, 2014, 08:30:34 AM

I've been experimenting with streams on the Y kernel. So far I've tested this on YAC and got 5.3 khash/s on my 660 Ti. Too bad it doesn't validate on the CPU though. The kernel must not be concurrent safe, =).

yes. right. there is one scratchpad but two streams. The scrypt_core kernels have to be serialized, or they would destroy each other's scratchpad. This is why I am using CUDA events.

Some overlap of memcpy and kernels would be desired (not happening now due to issue order of commands), and possibly the SHA256/Keccak kernels of one stream could be executed concurrently with the scrypt_core kernels of the other stream. This is also not happening now because my CUDA events currently also serialize these (need to change when events are generated and synchronized upon).

I intend to get rid of memcpy alltogether by checking hashes on the GPU instead, so the memcpy/kernel overlap issue is moot.

Christian

To be more specific, I was using 4 streams on nv_scrypt_core_kernelA and nv_scrypt_core_kernelB inside the NVKernel::run_kernel. So those are the kernels I was referring to. Too bad the code in these kernels looks like witchcraft to me at the moment. LOL

Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 973. (Read 3426989 times)