CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 897.

coinut

full member

Activity: 253

Merit: 100

just compiled and tested your latest commit 1117, 16/10/15

card is gtx 750 ti @ 1400mhz installed directly to motherboard
algo is lyra2REv2
os win7x64

i see 5106 displayed on the miner side locally

commit on 19 9 15

I get 5155 on the miner locally

myagui

legendary

Activity: 1154

Merit: 1001

Quote from: hashbrown9000 on October 16, 2015, 05:27:20 PM

Does someone have a working scrypt-jane command line? I can get my rig to hash, but all shares are rejected with "above target" error.

Here's mine for the ACX EVGA 750ti:

Code:

./ccminer -a scrypt-jane:15 -l t5x24 -L 5 -o stratum+tcp://scryptjaneleo.eu.nicehash.com:3348 -u 1HW533b9sZ3sbhZwRKtAtZaHTbRrpcaA7q -p x

IIRC, Leo is now at nfactor 16, so:
-a scrypt-jane:16

Edit: And you will probably need to tweak those launch parameters, as is customary across nfactor changes.

hashbrown9000

sr. member

Activity: 427

Merit: 250

Does someone have a working scrypt-jane command line? I can get my rig to hash, but all shares are rejected with "above target" error.

Here's mine for the ACX EVGA 750ti:

Code:

./ccminer -a scrypt-jane:15 -l t5x24 -L 5 -o stratum+tcp://scryptjaneleo.eu.nicehash.com:3348 -u 1HW533b9sZ3sbhZwRKtAtZaHTbRrpcaA7q -p x

Genoil

sr. member

Activity: 438

Merit: 250

Quote from: myagui on October 16, 2015, 12:55:34 PM

@Grim, ah, thanks for that. I had only skimmed through the thread on nvidia and the OP there is a bit misleading. I thought this was only affecting Windows users. I see Linux suffers as well, though not as badly.

I haven't produced the Linux values myself, so it could be unknown circumstances that have caused that drop. Then again, the values for the TCC driver that @allanmac showed look pretty similar.

Quote from: djeZo on October 16, 2015, 12:07:36 PM

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).
Sad sad sad...

I got a report from somebody with a 4GB 750Ti that there the limit was at 1 GB too. On Win8/10 the limit for Maxwell 1 is at 512MB and Maxwell 2 at 1024MB.

Quote from: sp_ on October 16, 2015, 12:43:05 PM

Chunks memory with linear reads and writss works.. In my private crypronight mod (10%) faster . I can use the optimal launch config that only used to work on linux

But windows 7 only..

8.1 and 10 doesnt work

ETH uses a random access pattern so that may explain the difference.

myagui

legendary

Activity: 1154

Merit: 1001

@Grim, ah, thanks for that. I had only skimmed through the thread on nvidia and the OP there is a bit misleading. I thought this was only affecting Windows users. I see Linux suffers as well, though not as badly.

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Chunks memory with linear reads and writss works.. In my private crypronight mod (10%) faster . I can use the optimal launch config that only used to work on linux

But windows 7 only..

8.1 and 10 doesnt work

Grim

sr. member

Activity: 506

Merit: 252

Quote from: myagui on October 16, 2015, 12:12:55 PM

Quote from: djeZo on October 16, 2015, 12:07:36 PM

[...] This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining.

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).

Sad sad sad...

Or ... you can switch to Linux?

Linux is degrading as well !!! just not as bad as windows wddm.

myagui

legendary

Activity: 1154

Merit: 1001

Quote from: djeZo on October 16, 2015, 12:07:36 PM

[...] This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining.

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).

Sad sad sad...

Or ... you can switch to Linux?

djeZo

hero member

Activity: 588

Merit: 520

Quote from: Genoil on October 16, 2015, 11:59:12 AM

Quote from: djeZo on October 16, 2015, 09:24:39 AM

Quote from: sp_ on October 16, 2015, 06:36:19 AM

Quote from: pallas on October 16, 2015, 06:28:47 AM

Quote from: djeZo on October 16, 2015, 06:25:09 AM

I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?

you could easily test it by running the same thing on the same card but thru a 1x raiser.

Thats why I made the -g switch. You get problems in windows when allocating big buffers.

Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower..

I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952

Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig?

Tried chunks on ethminer kernel. No difference at all.

Yep, I checked that on my Axiom algo, no difference - it actually got even slower.

It looks like if WDDM has hardcoded in value 2 giga - when you load so much or more, it starts paging memory to host memory, regardless of how much memory is there on video card. This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining.

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).

Sad sad sad...

Genoil

sr. member

Activity: 438

Merit: 250

Quote from: djeZo on October 16, 2015, 09:24:39 AM

Quote from: sp_ on October 16, 2015, 06:36:19 AM

Quote from: pallas on October 16, 2015, 06:28:47 AM

Quote from: djeZo on October 16, 2015, 06:25:09 AM

I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?

you could easily test it by running the same thing on the same card but thru a 1x raiser.

Thats why I made the -g switch. You get problems in windows when allocating big buffers.

Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower..

I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952

Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig?

Tried chunks on ethminer kernel. No difference at all.

bensam1231

legendary

Activity: 1764

Merit: 1024

Still testing out different difficulty settings for Myr-Gr for Digibyte and nothing is working out. It would definitely be worth looking into as it's worth mining more then Quark right now, even without the SP enhanced version (using Tpruvot).

djeZo

hero member

Activity: 588

Merit: 520

Quote from: sp_ on October 16, 2015, 06:36:19 AM

Quote from: pallas on October 16, 2015, 06:28:47 AM

Quote from: djeZo on October 16, 2015, 06:25:09 AM

I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?

you could easily test it by running the same thing on the same card but thru a 1x raiser.

Thats why I made the -g switch. You get problems in windows when allocating big buffers.

Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower..

I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952

Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig?

djeZo

hero member

Activity: 588

Merit: 520

Quote from: sp_ on October 16, 2015, 08:06:29 AM

Quote from: djeZo on October 16, 2015, 06:11:59 AM

Quote from: s7icky on October 15, 2015, 12:16:27 PM

Quote from: djeZo on October 15, 2015, 12:01:26 PM

What speeds do you get on GTX 980 Ti and GTX 950 Lyra2REv2?
I get
GTX 980 Ti ... 17.450 khs
GTX 950 ... 5.480 khs

clocks? OS? build?

around 1400mhs both cards, windows, latest SP... but I tweaked some params, originally I was getting 17khs on 980 Ti and 5khs on 950.

Would you mind to share your parameters?

Sure:

Code:

if (strstr(props.name, "970"))
{
intensity = 256 * 256 * 20;
}
else if (strstr(props.name, "980 Ti"))
{
intensity = 256 * 256 * 18;
tpb = 8;
}
else if (strstr(props.name, "980"))
{
intensity = 256 * 256 * 16;
}
else if (strstr(props.name, "750 Ti"))
{
intensity = 256 * 256 * 5;
tpb = 16;
}
else if (strstr(props.name, "750"))
{
intensity = 256 * 256 * 5;
tpb = 16;
}
else if (strstr(props.name, "960"))
{
intensity = 256 * 256 * 6;
}
else if (strstr(props.name, "950"))
{
intensity = 256 * 256 * 18;
tpb = 11;
}

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: djeZo on October 16, 2015, 06:11:59 AM

Quote from: s7icky on October 15, 2015, 12:16:27 PM

Quote from: djeZo on October 15, 2015, 12:01:26 PM

What speeds do you get on GTX 980 Ti and GTX 950 Lyra2REv2?
I get
GTX 980 Ti ... 17.450 khs
GTX 950 ... 5.480 khs

clocks? OS? build?

around 1400mhs both cards, windows, latest SP... but I tweaked some params, originally I was getting 17khs on 980 Ti and 5khs on 950.

Would you mind to share your parameters?

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: Slava_K on October 16, 2015, 07:05:29 AM

Lyra2REv2 -X 12 looks good (test on 750Ti, GTX960, GTX980).

The default values are:

gtx 980TI: -X 16 (probobly -X 24 or -X 30 is bether (-X 30 is using abit more than 3 gig of memory)
gtx 980: -X 16 (probobly -X 24 or -X 30 is bether)
gtx 970: -X 20 (I tried different values. -X 20 is just over 2 Gigabyte of memory and looks good)
gtx 960: -X 5 (-x 10, -x 11 / -x 12)
gtx 950: -X 8 (-x 10, -x 11 / -x 12)
gtx 750: -X 5 (-x 8 )
gtx 750ti: -X 5 (-x 10,-x 11 / -x 12)

The default -X values are probobly not optimal.
I had some problems with some of my rigs (out of memory (6 cards)) (4gig system ram)
So I lowered the default intensities..
For cards that trottle, use lower values for -X.

Slava_K

hero member

Activity: 677

Merit: 500

Lyra2REv2 -X 12 looks good (test on 750Ti, GTX960, GTX980).

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: kenshirothefist on October 16, 2015, 06:28:00 AM

Hey, sp_, we've added links to your modded ccminer on our software page: https://www.nicehash.com/index.jsp?p=software#nvidiagpu
Here is a tip for you: https://blockchain.info/tx/e4a5a665975202fd23f867f6f4973cb3ba6ec5d7703a9d3d62a4724b2f2f4598
Keep up the good work!

Thanks

sp_

legendary

Activity: 2954

Merit: 1087

Team Black developer

Quote from: pallas on October 16, 2015, 06:28:47 AM

Quote from: djeZo on October 16, 2015, 06:25:09 AM

I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?

you could easily test it by running the same thing on the same card but thru a 1x raiser.

Thats why I made the -g switch. You get problems in windows when allocating big buffers.

Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower..

pallas

legendary

Activity: 2716

Merit: 1094

Black Belt Developer

Quote from: djeZo on October 16, 2015, 06:25:09 AM

I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?

you could easily test it by running the same thing on the same card but thru a 1x raiser.

kenshirothefist

sr. member

Activity: 457

Merit: 273

Hey, sp_, we've added links to your modded ccminer on our software page: https://www.nicehash.com/index.jsp?p=software#nvidiagpu

Here is a tip for you: https://blockchain.info/tx/e4a5a665975202fd23f867f6f4973cb3ba6ec5d7703a9d3d62a4724b2f2f4598

Keep up the good work!

Topic: CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 897. (Read 2347659 times)