Author

Topic: CCminer(SP-MOD) Modded NVIDIA Maxwell / Pascal kernels. - page 897. (Read 2347659 times)

full member
Activity: 253
Merit: 100
just compiled and tested your latest commit 1117, 16/10/15

card is gtx 750 ti @ 1400mhz installed directly to motherboard
algo is lyra2REv2
os win7x64

i see 5106 displayed on the miner side locally

commit on 19 9 15

I get 5155 on the miner locally



 
legendary
Activity: 1154
Merit: 1001
Does someone have a working scrypt-jane command line? I can get my rig to hash, but all shares are rejected with "above target" error.

Here's mine for the ACX EVGA 750ti:

Code:
./ccminer -a scrypt-jane:15 -l t5x24 -L 5 -o stratum+tcp://scryptjaneleo.eu.nicehash.com:3348 -u 1HW533b9sZ3sbhZwRKtAtZaHTbRrpcaA7q -p x


IIRC, Leo is now at nfactor 16, so:
-a scrypt-jane:16

Edit: And you will probably need to tweak those launch parameters, as is customary across nfactor changes.
sr. member
Activity: 427
Merit: 250
Does someone have a working scrypt-jane command line? I can get my rig to hash, but all shares are rejected with "above target" error.

Here's mine for the ACX EVGA 750ti:

Code:
./ccminer -a scrypt-jane:15 -l t5x24 -L 5 -o stratum+tcp://scryptjaneleo.eu.nicehash.com:3348 -u 1HW533b9sZ3sbhZwRKtAtZaHTbRrpcaA7q -p x
sr. member
Activity: 438
Merit: 250
@Grim, ah, thanks for that. I had only skimmed through the thread on nvidia and the OP there is a bit misleading. I thought this was only affecting Windows users. I see Linux suffers as well, though not as badly.

I haven't produced the Linux values myself, so it could be unknown circumstances that have caused that drop. Then again, the values for the TCC driver that @allanmac showed look pretty similar.

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).
Sad sad sad...

I got a report from somebody with a 4GB 750Ti that there the limit was at 1 GB too. On Win8/10 the limit for Maxwell 1 is at 512MB and Maxwell 2 at 1024MB.
Chunks memory with linear reads  and writss works.. In my private crypronight mod (10%) faster . I can use the optimal launch config that only used to work on linux

But windows 7 only..

8.1 and 10  doesnt work

ETH uses  a random access pattern so that may explain the difference.
legendary
Activity: 1154
Merit: 1001
@Grim, ah, thanks for that. I had only skimmed through the thread on nvidia and the OP there is a bit misleading. I thought this was only affecting Windows users. I see Linux suffers as well, though not as badly.
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Chunks memory with linear reads  and writss works.. In my private crypronight mod (10%) faster . I can use the optimal launch config that only used to work on linux

But windows 7 only..

8.1 and 10  doesnt work
sr. member
Activity: 506
Merit: 252
[...] This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining.

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).

Sad sad sad...

Or ... you can switch to Linux?

Linux is degrading as well !!! just not as bad as windows wddm.
legendary
Activity: 1154
Merit: 1001
[...] This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining.

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).

Sad sad sad...

Or ... you can switch to Linux?
hero member
Activity: 588
Merit: 520
I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?
you could easily test it by running the same thing on the same card but thru a 1x raiser.

Thats why I made the -g switch. You get problems in windows when allocating big buffers.

Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower..


I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952

Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig?

Tried chunks on ethminer kernel. No difference at all.

Yep, I checked that on my Axiom algo, no difference - it actually got even slower.

It looks like if WDDM has hardcoded in value 2 giga - when you load so much or more, it starts paging memory to host memory, regardless of how much memory is there on video card. This seems to me like a big issue and makes all high end NVIDIA cards useless for mining as you can never exploit their true potential for mining.

Also, since GTX 750 Ti has this limit at 1 giga, it makes all 2 gig versions of GTX 750 Ti useless for mining (memory hard algorithms).

Sad sad sad...
sr. member
Activity: 438
Merit: 250
I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?
you could easily test it by running the same thing on the same card but thru a 1x raiser.

Thats why I made the -g switch. You get problems in windows when allocating big buffers.

Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower..


I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952

Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig?

Tried chunks on ethminer kernel. No difference at all.
legendary
Activity: 1764
Merit: 1024
Still testing out different difficulty settings for Myr-Gr for Digibyte and nothing is working out. It would definitely be worth looking into as it's worth mining more then Quark right now, even without the SP enhanced version (using Tpruvot).
hero member
Activity: 588
Merit: 520
I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?
you could easily test it by running the same thing on the same card but thru a 1x raiser.

Thats why I made the -g switch. You get problems in windows when allocating big buffers.

Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower..


I did some further analysis and there must be some serious memory sync or copying or something happening over PCIE bus when working with large buffers; check my reply here: https://devtalk.nvidia.com/default/topic/878455/cuda-programming-and-performance/gtx750ti-and-buffers-gt-1gb-on-win7/post/4696955/#4697952

Are you saying that if you allocate smaller chunks (with cuda malloc), everything is ok then, even if total mem usage is above 2gig?
hero member
Activity: 588
Merit: 520
What speeds do you get on GTX 980 Ti and GTX 950 Lyra2REv2?
I get
GTX 980 Ti ... 17.450 khs
GTX 950 ... 5.480 khs
clocks? OS? build?
around 1400mhs both cards, windows, latest SP... but I tweaked some params, originally I was getting 17khs on 980 Ti and 5khs on 950.

Would you mind to share your parameters? Smiley

Sure:

Code:
if (strstr(props.name, "970"))
{
intensity = 256 * 256 * 20;
}
else if (strstr(props.name, "980 Ti"))
{
intensity = 256 * 256 * 18;
tpb = 8;
}
else if (strstr(props.name, "980"))
{
intensity = 256 * 256 * 16;
}
else if (strstr(props.name, "750 Ti"))
{
intensity = 256 * 256 * 5;
tpb = 16;
}
else if (strstr(props.name, "750"))
{
intensity = 256 * 256 * 5;
tpb = 16;
}
else if (strstr(props.name, "960"))
{
intensity = 256 * 256 * 6;
}
else if (strstr(props.name, "950"))
{
intensity = 256 * 256 * 18;
tpb = 11;
}
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
What speeds do you get on GTX 980 Ti and GTX 950 Lyra2REv2?
I get
GTX 980 Ti ... 17.450 khs
GTX 950 ... 5.480 khs
clocks? OS? build?
around 1400mhs both cards, windows, latest SP... but I tweaked some params, originally I was getting 17khs on 980 Ti and 5khs on 950.

Would you mind to share your parameters? Smiley
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
Lyra2REv2 -X 12 looks good (test on 750Ti, GTX960, GTX980).

The default values are:

gtx 980TI: -X 16 (probobly -X 24 or -X 30 is bether (-X 30 is using abit more than 3 gig of memory)
gtx 980: -X 16 (probobly -X 24 or -X 30 is bether)
gtx 970: -X 20 (I tried different values. -X 20 is just over 2 Gigabyte of memory and looks good)
gtx 960: -X 5  (-x 10, -x 11 / -x 12)
gtx 950: -X 8  (-x 10, -x 11 / -x 12)
gtx 750: -X 5  (-x 8 )
gtx 750ti: -X 5 (-x 10,-x 11 / -x 12)

The default -X values are probobly not optimal.
I had some problems with some of my rigs (out of memory (6 cards)) (4gig system ram)
So I lowered the default intensities..
For cards that trottle, use lower values for -X.
hero member
Activity: 677
Merit: 500
Lyra2REv2 -X 12 looks good (test on 750Ti, GTX960, GTX980).
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
sp_
legendary
Activity: 2954
Merit: 1087
Team Black developer
I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?
you could easily test it by running the same thing on the same card but thru a 1x raiser.

Thats why I made the -g switch. You get problems in windows when allocating big buffers.

Running quark with -g 2 -i 24 is using the same amount of memory as -i 25 but the 2 intensity blocks are split into two. -i 25 will cause out of memory while -g 2 -i 24 will not. But we had to add some more logic to the the g switch to work .(blake 512 rewrite) so it might be slower..




legendary
Activity: 2716
Merit: 1094
Black Belt Developer
I was experiencing same kind of issue when I was making Axiom CUDA algo. Having 980 Ti, which packs 6 gig of memory, whenever I set algo to use more than about 2,5 gigs, there was a massive slow down, bus interface load jumped up, TDP jumped down. Since 980 Ti is my primary GPU, it constantly has mem load of about 400 mega even in idle time - and that would explain that actual mem cutoff is at around 2.1 gigs - same as other v2 maxwell cards.

I don't have account there to post, but measure bus interface load during these bottlenecks - maybe it can reveal another hint getting down (I used GPUZ for measuring bus interface load).

Bus interface load is - to my knowledge - how much PCIE bus gets loaded with data. And my algorithm implementation was sending very very little data over this bus - not something to load PCIE 3.0 16x so massively that it would show 30-50% of load. I could not explain, why bus load was so high, googling gave no results and I kinda gave up. But now that you revealed this slow down happening with other algorithms, other cards, I have my suspicion that these problems are related. My first idea would be; what if CUDA is automatically syncing GPU and CPU memory - as if some part of GPU memory was set to be in sync with CPU memory - this would explain massive bus load, as my algo was causing a lot of changes in this massive allocated buffer. I believe, CUDA even has a name for this - Unified memory. And to my knowledge, it is only active when you explicitly set so. What if it is active even in cases when you do not explicitly set so? Or maybe a bug in CUDA software - sending data over bus even though there is no need for synced memory space?

you could easily test it by running the same thing on the same card but thru a 1x raiser.
sr. member
Activity: 457
Merit: 273
Hey, sp_, we've added links to your modded ccminer on our software page: https://www.nicehash.com/index.jsp?p=software#nvidiagpu

Here is a tip for you: https://blockchain.info/tx/e4a5a665975202fd23f867f6f4973cb3ba6ec5d7703a9d3d62a4724b2f2f4598

Keep up the good work!
Jump to: