[ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 1119.

termhn

full member

Activity: 126

Merit: 100

Check end of the first post here for brief explanation of -expiry
https://bitcointalk.org/index.php?topic=178286.0;topicseen

cbuchner1

hero member

Activity: 756

Merit: 502

I have released an April-17th version. Source code on github will be updated tomorrow.

It reduces the CPU usage greatly when run in interactive mode on Windows (in fact I think Windows is not measuring it correctly now, because it seems to hover around 0-1% for me). But it might be hashing a bit slower than before when running interactively. ATTENTION: I just remembered that I default to interactive mode now when not otherwise specified. This is done for all CUDA devices that have the watchdog timer active (i.e. they are driving a display and no registry fix was put in place to disable the watchdog for that card).

The texture cache feature now works but is detrimental to performance on all but fast Kepler cards where it doesn't make things neither better nor worse. Note the -C option now takes "1" or "2" as input, corresponding to 1D or 2D texture layouts. Consider this experimental, still.

On Titan I have split the CUDA kernel into two parts labelled "A" and "B", like it was already done for the non-Titan kernels . The "B" part should now automatically read through the 48kb/SMX texture caches by use of const __restrict__ pointers (this nice feature is Titan and Tesla K20 specific). No need to use the -C option there. Let me know if the Titan still works, and if it's running any slower or faster.

CTRL-C works better than before. Aborts autotuning, and hitting it a second time will ALWAYS abort the tool.

Bye for now! I still take donations! Wink

I will be experimenting with LOOKUP_GAP next, like it is known from OpenCL miners.

Christian

FalconFour

full member

Activity: 176

Merit: 100

Quote from: cbuchner1 on April 17, 2013, 10:01:49 AM

Quote from: MiWBitCoin on April 17, 2013, 09:42:46 AM

Lets play spot the difference:
CUDAminer 2013-04-13: [2013-04-18 00:09:23] accepted: 284/287 (98.95%), 162.91 khash/s (yay!!!)

yay, and it's good hash, too. *puff*, *puff*, *smoke*

No camping, muffukka! Pass that shit :3 hehe

Quote from: SubNoize on April 17, 2013, 06:01:11 AM

Quote from: FalconFour on April 17, 2013, 04:06:04 AM

Quote from: cbuchner1 on April 17, 2013, 02:42:12 AM

Beats my GTX 260 - maybe I should have turned off Aero too Wink

Aero is generally a good thing that actually improves system performance - and I generally recommend everyone leave it on/turn it back on if disabled (even for Bitcoin mining)... but here, the way the program works probably nets a performance gain by cutting any excess GPU activity. Weird, but it could be alpha blues Wink

can you expand on that a little please? I would of assumed that by disabling it you're freeing up more gpu power for other tasks e.g. mining ?

Aero doesn't really use much GPU power at all. Because the new display architecture of WinNT 6.x (Vista/7/8) is built around Aero at the core, non-Aero (with "desktop composition" disabled) actually runs in an emulation mode. Lots of time is spent slowly re-drawing bitmaps on the screen, from what I've seen. Windows XP used a hardware-accelerated windows manager (GDI) that communicated window draw/move/animation commands to the video card. With the new architecture, it's built around being accelerated by the GPU. So when you disable Aero, Windows now emulates everything - taking up *more* CPU and GPU time than if you had Aero enabled so the GPU could idly manage the native mode of Windows' desktop window manager.

That's why if a computer can't handle Aero, I put XP on it (for a refurbished system, that is). If it CAN handle Aero, I put 7 on it and make sure Aero works. If I ever see a computer come in the shop with Aero disabled (classic theme, etc.), I enable Aero and it significantly improves the responsiveness of the system. As for Bitcoin mining, I've tried with and without Aero - I get a lower hashrate without Aero. That's all the proof I needed. Tongue

termhn

full member

Activity: 126

Merit: 100

Quote from: cbuchner1 on April 17, 2013, 10:00:56 AM

Quote from: termhn on April 17, 2013, 09:53:12 AM

The thing on my wishlist is an option like cgminer's -expiry 1 or -E 1 for coins that are getting new blocks reallllly fast.

Can you explain how that option works, or point me to a README with an explanation? what other coins would you be mining for using with the scrypt algorithm?

Sure I can in a few minutes. I am using it with feathercoin.

cbuchner1

hero member

Activity: 756

Merit: 502

Quote from: mcarturr on April 17, 2013, 12:08:52 PM

i got 150 khs/s with a 670 gtx

I've quickly tested the 1D and 2D texture cache with a GTX 660Ti. The achieved values during hashing remain in the range of 153-155 kHash. Or in other words: almost indentical to operation without the cache which is 154 kHash. All generated shares are are valid.

So while the cache feature is not immediately useful, it may become useful as soon as we start shrinking the scrypt scratchpad.

Here is what a LOOKUP_GAP implementation does: it only saves every N'th value out of 1024 writes to the scratchpad, thereby reducing the bandwidth needed for writes by factor N. One "value" here is a vector of 32 uints (128 bytes in total). 1024 * 128 bytes is your typical 128 kbytes scrypt scratchpad. Here's some actual C-code for the programmers among you.

Code:

uint32_t X[32]; int i,j,k;

for (k = 0; k < 32; k++) X[k] = input[k];

// write phase to scratchpad
for (i = 0; i < 1024; i++) {
memcpy(&V[i * 32], X, 128);
xor_salsa8(&X[0], &X[16]);
xor_salsa8(&X[16], &X[0]);
}

// read phase from scratchpad
for (i = 0; i < 1024; i++) {
j = 32 * (X[16] & 1023);
for (k = 0; k < 32; k++) X[k] ^= V[j + k];
xor_salsa8(&X[0], &X[16]);
xor_salsa8(&X[16], &X[0]);
}

for (k = 0; k < 32; k++) output[k] = X[k];

Unfortunately we still have to do 1024 reads from the scratchpad and perform an increased amount of computation to re-synthesize the values that we ommitted during writing. For a LOOKUP_GAP of 4 we would (on average) have to run xor-salsa 2 extra times per lookup. But here's where the cache may come handy. Say you have reduced the total number of scratchpad values from 1024 to 256 using a LOOKUP_GAP of 4. That may increase your cache hit ratio because you are now going to read each element 4 times on average, therefore boosting your effective memory bandwidth for the reads. Unfortunately the lookups are done in a completely random order which causes a lot of cache lines to get replaced quickly.

Problematic is that the numbers quoted above are per hash, and your typical GPU does a few dozen to a few hundred hashes in parallel on its various multiprocessors. And pre-Kepler, each multiprocessor only has 6-8k of texture cache. On Kepler one SMX has 48kb of texture cache. How are the numbers going to work out? I don't really know yet. The cache seems awfully small for what it needs to cover.

Christian

mcarturr

newbie

Activity: 38

Merit: 0

i got 150 khs/s with a 670 gtx

cbuchner1

hero member

Activity: 756

Merit: 502

Quote from: MiWBitCoin on April 17, 2013, 09:42:46 AM

Lets play spot the difference:
CUDAminer 2013-04-13: [2013-04-18 00:09:23] accepted: 284/287 (98.95%), 162.91 khash/s (yay!!!)

yay, and it's good hash, too. *puff*, *puff*, *smoke*

cbuchner1

hero member

Activity: 756

Merit: 502

Quote from: termhn on April 17, 2013, 09:53:12 AM

The thing on my wishlist is an option like cgminer's -expiry 1 or -E 1 for coins that are getting new blocks reallllly fast.

Can you explain how that option works, or point me to a README with an explanation? what other coins would you be mining for using with the scrypt algorithm?

termhn

full member

Activity: 126

Merit: 100

The thing on my wishlist is an option like cgminer's -expiry 1 or -E 1 for coins that are getting new blocks reallllly fast.

MiWBitCoin

newbie

Activity: 12

Merit: 0

Lets play spot the difference:

cgminer 2.11.3
GPU 0: | 88.38K/88.28Kh/s | A:9 R:0 HW:0 U:16.03/m I:12

CUDAminer 2013-04-13
[2013-04-18 00:09:04] accepted: 277/280 (98.93%), 164.48 khash/s (yay!!!)
[2013-04-18 00:09:12] GPU #0: GeForce GTX 680, 1284224 hashes, 167.86 khash/s
[2013-04-18 00:09:23] accepted: 284/287 (98.95%), 162.91 khash/s (yay!!!)

that is we call a significant improvement in hashing power.
Thank you very much Christian, this is very impressive.

termhn

full member

Activity: 126

Merit: 100

Quote from: SubNoize on April 17, 2013, 06:01:11 AM

Quote from: FalconFour on April 17, 2013, 04:06:04 AM

Quote from: cbuchner1 on April 17, 2013, 02:42:12 AM

Beats my GTX 260 - maybe I should have turned off Aero too Wink

Aero is generally a good thing that actually improves system performance - and I generally recommend everyone leave it on/turn it back on if disabled (even for Bitcoin mining)... but here, the way the program works probably nets a performance gain by cutting any excess GPU activity. Weird, but it could be alpha blues Wink

can you expand on that a little please? I would of assumed that by disabling it you're freeing up more gpu power for other tasks e.g. mining ?

When it's off that work gets put back on the CPU.

theowalpott

member

Activity: 80

Merit: 10

Getting about 48KH/s out of a Geforce 240GT.. although CPU usage seems pretty high.. using ~300% (4 cpus total).

Thanks for your efforts - shall keep an eye on it Cheesy

SubNoize

newbie

Activity: 47

Merit: 0

Quote from: FalconFour on April 17, 2013, 04:06:04 AM

Quote from: cbuchner1 on April 17, 2013, 02:42:12 AM

Beats my GTX 260 - maybe I should have turned off Aero too Wink

Aero is generally a good thing that actually improves system performance - and I generally recommend everyone leave it on/turn it back on if disabled (even for Bitcoin mining)... but here, the way the program works probably nets a performance gain by cutting any excess GPU activity. Weird, but it could be alpha blues Wink

can you expand on that a little please? I would of assumed that by disabling it you're freeing up more gpu power for other tasks e.g. mining ?

FalconFour

full member

Activity: 176

Merit: 100

Quote from: cbuchner1 on April 17, 2013, 02:42:12 AM

Beats my GTX 260 - maybe I should have turned off Aero too Wink

Aero is generally a good thing that actually improves system performance - and I generally recommend everyone leave it on/turn it back on if disabled (even for Bitcoin mining)... but here, the way the program works probably nets a performance gain by cutting any excess GPU activity. Weird, but it could be alpha blues Wink

FalconFour

full member

Activity: 176

Merit: 100

Quote from: rimasb on April 17, 2013, 02:03:54 AM

Quote from: cbuchner1 on April 16, 2013, 06:42:01 PM

Quote from: FalconFour on April 16, 2013, 06:34:27 PM

From 8khash to 28khash. Interesting. Now, wonder what the 8800GTX will do...

That's more like what I would have expected from these cards.

My 9800 GT looks better:

CudaMiner 13/04
GeForce 9800 GT OC
Core Clock 780
Memory Clock 1008
Windows XP 32bit
314.22 Driver
-l 14x4

I'm totally copying off your page tomorrow morning at the shop. Many thanks. <3 I didn't know XP could GPU-mine at all. Thought that was a new WinNT 6.x architecture thing...

cbuchner1

hero member

Activity: 756

Merit: 502

Beats my GTX 260 - maybe I should have turned off Aero too Wink

rimasb

newbie

Activity: 43

Merit: 0

Quote from: cbuchner1 on April 16, 2013, 06:42:01 PM

Quote from: FalconFour on April 16, 2013, 06:34:27 PM

From 8khash to 28khash. Interesting. Now, wonder what the 8800GTX will do...

That's more like what I would have expected from these cards.

My 9800 GT looks better:

CudaMiner 13/04
GeForce 9800 GT OC
Core Clock 780
Memory Clock 1008
Windows XP 32bit
314.22 Driver
-l 14x4

FalconFour

full member

Activity: 176

Merit: 100

Might have the 8800gtx's crash figured out:

Disable Desktop Window Manager, start cudaMiner -> CRASH
Enable DWM, start cudaMiner -> Runs (~30mhash/sec = shit, as I've seen 40+ from this thing)
Enable DWM, start cudaMiner, disable DWM just as the autotune starts -> Runs - autotune figures improve as I disable DWM, but still finds 41.03khash/sec with 30x2 and starts mining at 27.31khash/sec.

edit: Yeah, autotune is definitely buggy. No autotune (plug my own numbers in) -> works fine without DWM.

In fact, I plugged "32x2" in just on a whim, and it initially said it was getting 9.38khash/sec with just 4096 hashes. Then, moments later, it started accepting results - and it was pumping 31.06! This definitely smells like an autotune bug to me...

Maybe you could autotune while the miner runs. For example, each time a result is accepted, adjust hashes/warps (not sure the implications of each) up/down by small or large jumps - then save the best configuration once you find a good hot-spot that "N" adjustments haven't been able to beat - say after 100 adjustments ("14x2", "15x2", "8x2", "16x3", etc), it hasn't improved performance over "16x2" or whatever it's found best. Then save it as a .conf file and read that on startup

Edvin512

full member

Activity: 167

Merit: 100

gtx 460 815/2000 - 28x4 - 116 k/h with 14.4 release

FalconFour

full member

Activity: 176

Merit: 100

Quote from: cbuchner1 on April 16, 2013, 06:42:01 PM

Quote from: FalconFour on April 16, 2013, 06:34:27 PM

From 8khash to 28khash. Interesting. Now, wonder what the 8800GTX will do...

That's more like what I would have expected from these cards.

Strangely, when I enable texture caching the determined performance during autotune is about 10-25% higher than without cache. But the achieved performance during the mining is way about 30% less than without cache. So why does the performance advantage turn into a disadvantage? This discrepancy needs to be understood before I can put out another version. I've even tried to completely randomize the input data during autotune - but no change. I really want to get that measured gain into the actual mining. I hope it's not just an illusion.

This could be along the same issue with the short auto-tune duration problem. Maybe the texture cache benefits for a very short time but starts deteriorating slowly (on the order of whole seconds, not milliseconds). Maybe try a narrow set of autotune parameters (it's unlikely that a card would ever see any autotune benefit in the sequential range from 20...100 iterations) and run longer tuning per each combination? Basically, not every cell in the autotune matrix needs to be checked, I think.

Also getting super-erratic behavior right now after updating drivers on the 8800GTX. It launches and identifies "compute capability 1.0", but it cranks out all zeroes on the autotune then crashes (hard). That's with the latest driver I just installed, 314.22. :/

Topic: [ANN] cudaMiner & ccMiner CUDA based mining applications [Windows/Linux/MacOSX] - page 1119. (Read 3426996 times)