Pages:
Author

Topic: Nvidia GPU Mining Problems - page 4. (Read 7002 times)

legendary
Activity: 3164
Merit: 1003
July 28, 2016, 11:37:40 AM
#68
AFAIK=as far as I know

If its a temperature problem, theres a system level component that's having an issue I would think.

Run a motherboard monitor and CPU monitor.... but a voltage issue could still be the case.   I hope you are figuring your power availability per rig @ 120-160% expected draw?  If I ran a 4 card machine I would be for sure running a 1600w power supply..... 

I wonder if you are drawing too much +12V off the same rail that supplies the processor and are causing this all to happen.

I have seen many strange configurations once opening up power supplies and seeing what is tapping which available rail.  Many PC power supplies have 3-4 independent +12V power supply circuits at roughly 50A each...  give or take....    As far as knowing how they are distributed...... you have to open the power supply often to know the real truth...   

You dont happen to be CPU mining at the same time are you?
Yes I did have that issue ... I can't add another card 970 without sli the 2 psu's 1300 watt each.
I did try taking 1 980ti off that rail and add it to the 2nd psu along with the 970.
It crashed ... but I should try the new 970 all by itself. That maybe a bad card.. not sure..I did a rma on it and it also crashed.
I think the main problem is the asrock btc 81 pro cant handle that many high end cards running windows 8.1.
And the 2nd most thing was temp spin down time changing algo's which is fixed. I think.
member
Activity: 70
Merit: 10
July 28, 2016, 11:34:28 AM
#67
 I hope you are figuring your power availability per rig @ 120-160% expected draw?  If I ran a 4 card machine I would be for sure running a 1600w power supply..... 

I wonder if you are drawing too much +12V off the same rail that supplies the processor and are causing this all to happen.

I have seen many strange configurations once opening up power supplies and seeing what is tapping which available rail.  Many PC power supplies have 3-4 independent +12V power supply circuits at roughly 50A each...  give or take....    As far as knowing how they are distributed...... you have to open the power supply often to know the real truth...   


I dont really agree with the above (per-se):

1) I agree with giving about 20% headroom, but a 1600W PSU for 4 cards isnt necesary (unless they are 290/390 cards or some other varient that would be drawing 300w/card). the rx480 or gtx1070 you could run 6 cards with a 1200-1300W PSU just fine.

2) most quality power supplies have a single 12V rail, and the ones wit multiple rails normally are more like 2-3 rails at 30A each (3x30Ax12V=1080W)  Your suggested 3x50Ax12V PSU would be a 1800W+ beast

3) you DONT need to (or want to) open up your PSU and start poking around. youll void the warranty, risk damage, and waste your time. any half-decent power suply will have the power rating and rail ratings marked on it and also on its packaging. If not, use google.

pretty much any PSU that is gold-rated and costs >$100 should be a single 12V rail thats rated at about 95% of the actual PSU specification.

for example, the corsair ax1200 has 1202W on a single 12V rail: http://www.corsair.com/en/professional-series-gold-ax1200-80-plus-gold-certified-fully-modular-power-supply  (click on the technical specs tab)
legendary
Activity: 3164
Merit: 1003
July 28, 2016, 11:25:12 AM
#66

What is AFAIK? joblo  and when that neoscrypt or lyra2v2 that it crashes on once in a great while showing this sign, some people have reported that I think... but also happens at room temp too. There maybe to many cards for windows memory to handle. But will look into what you said.
Someone did say they were having problems with those private releases.
Thx

http://www.urbandictionary.com/define.php?term=afaik

This the first time you posted this symptom, has it always crashed this way?
If you think it's the miner try a different one.
Thx This happens once every  2 weeks unpredictable and I think it maybe the app ccminer not sure.
I just wanted to post that as one of the secondary issues for now.
Right now I'm mining a straight algo and no crashes at all... very very smooth at 93f room temp.
legendary
Activity: 1848
Merit: 1166
My AR-15 ID's itself as a toaster. Want breakfast?
July 28, 2016, 08:15:33 AM
#65
AFAIK=as far as I know

If its a temperature problem, theres a system level component that's having an issue I would think.

Run a motherboard monitor and CPU monitor.... but a voltage issue could still be the case.   I hope you are figuring your power availability per rig @ 120-160% expected draw?  If I ran a 4 card machine I would be for sure running a 1600w power supply..... 

I wonder if you are drawing too much +12V off the same rail that supplies the processor and are causing this all to happen.

I have seen many strange configurations once opening up power supplies and seeing what is tapping which available rail.  Many PC power supplies have 3-4 independent +12V power supply circuits at roughly 50A each...  give or take....    As far as knowing how they are distributed...... you have to open the power supply often to know the real truth...   

You dont happen to be CPU mining at the same time are you?
legendary
Activity: 1470
Merit: 1114
July 27, 2016, 05:45:07 PM
#64

What is AFAIK? joblo  and when that neoscrypt or lyra2v2 that it crashes on once in a great while showing this sign, some people have reported that I think... but also happens at room temp too. There maybe to many cards for windows memory to handle. But will look into what you said.
Someone did say they were having problems with those private releases.
Thx

http://www.urbandictionary.com/define.php?term=afaik

This the first time you posted this symptom, has it always crashed this way?
If you think it's the miner try a different one.
legendary
Activity: 3164
Merit: 1003
July 27, 2016, 02:46:31 PM
#63

EDIT: I do have the temperatures of the cards set at a max.. to not exceed 79c ect.
And I use to get this  but was going to talk about it later.
Going to myr-gr  also on neoscrypt  so not related to algo but memory or ccminer? Maybe intensity setting.


Looks like a null pointer dereference. That's usually software but in your case it could be excess heat in the CPU or RAM.
How is the ventilation around the mobo? Maybe heat from the GPUs is destabilizing the CPU.

Edit: It could also be bad RAM. Make note if they are always the same, especially the instruction address.
It could be a bug in ccminer for neoscrypt lyra2v2 ect. not sure.
A fan at high speed but the temperatures were so high, room temp, that I turned off the rig at 1pm to 7pm some times...cooler temp on its way.
There's not much I can do until temps drop... it broke records.

If it was a bug others would likely also see it, but AFAIK no one else has seen this crash. You're getting corruption in the CPU domain
(core, cache, ram), either due to a HW fault or heat induced. If there is a pattern to the fault addresses it's probably a HW fault.
If the're random, probably heat induced.

Due to the extremely high ambient temperature your sensors may not detect overheating in places where it isn't expected.
What is AFAIK? joblo  and when that neoscrypt or lyra2v2 that it crashes on once in a great while showing this sign, some people have reported that I think... but also happens at room temp too. There maybe to many cards for windows memory to handle. But will look into what you said.
Someone did say they were having problems with those private releases.
Thx
legendary
Activity: 1470
Merit: 1114
July 27, 2016, 11:11:16 AM
#62

EDIT: I do have the temperatures of the cards set at a max.. to not exceed 79c ect.
And I use to get this  but was going to talk about it later.
Going to myr-gr  also on neoscrypt  so not related to algo but memory or ccminer? Maybe intensity setting.


Looks like a null pointer dereference. That's usually software but in your case it could be excess heat in the CPU or RAM.
How is the ventilation around the mobo? Maybe heat from the GPUs is destabilizing the CPU.

Edit: It could also be bad RAM. Make note if they are always the same, especially the instruction address.
It could be a bug in ccminer for neoscrypt lyra2v2 ect. not sure.
A fan at high speed but the temperatures were so high, room temp, that I turned off the rig at 1pm to 7pm some times...cooler temp on its way.
There's not much I can do until temps drop... it broke records.

If it was a bug others would likely also see it, but AFAIK no one else has seen this crash. You're getting corruption in the CPU domain
(core, cache, ram), either due to a HW fault or heat induced. If there is a pattern to the fault addresses it's probably a HW fault.
If the're random, probably heat induced.

Due to the extremely high ambient temperature your sensors may not detect overheating in places where it isn't expected.
legendary
Activity: 3164
Merit: 1003
July 27, 2016, 07:13:00 AM
#61

EDIT: I do have the temperatures of the cards set at a max.. to not exceed 79c ect.
And I use to get this  but was going to talk about it later.
Going to myr-gr  also on neoscrypt  so not related to algo but memory or ccminer? Maybe intensity setting.


Looks like a null pointer dereference. That's usually software but in your case it could be excess heat in the CPU or RAM.
How is the ventilation around the mobo? Maybe heat from the GPUs is destabilizing the CPU.

Edit: It could also be bad RAM. Make note if they are always the same, especially the instruction address.
It could be a bug in ccminer for neoscrypt lyra2v2 ect. not sure.
A fan at high speed but the temperatures were so high, room temp, that I turned off the rig at 1pm to 7pm some times...cooler temp on its way.
There's not much I can do until temps drop... it broke records.
legendary
Activity: 3164
Merit: 1003
July 27, 2016, 07:08:10 AM
#60

EDIT: I do have the temperatures of the cards set at a max.. to not exceed 79c ect.
And I use to get this  but was going to talk about it later.
Going to myr-gr  also on neoscrypt  so not related to algo but memory or ccminer? Maybe intensity setting.


Looks like a null pointer dereference. That's usually software but in your case it could be excess heat in the CPU or RAM.
How is the ventilation around the mobo? Maybe heat from the GPUs is destabilizing the CPU.

Edit: It could also be bad RAM. Make note if they are always the same, especially the instruction address.

Try a different power supply.

You'd be amazed how many are bad and have you point the finger somewhere else.
Thx  That is my 2nd psu and i have a third to sli for my extra 970 but it is hard for windows to recognize it and it crashes within 3 minutes. So since I made some changes I will try again soon.
I have a fan at high speed on the rig.
This only comes up once in a great while.
legendary
Activity: 1848
Merit: 1166
My AR-15 ID's itself as a toaster. Want breakfast?
July 26, 2016, 11:04:29 PM
#59

EDIT: I do have the temperatures of the cards set at a max.. to not exceed 79c ect.
And I use to get this  but was going to talk about it later.
Going to myr-gr  also on neoscrypt  so not related to algo but memory or ccminer? Maybe intensity setting.


Looks like a null pointer dereference. That's usually software but in your case it could be excess heat in the CPU or RAM.
How is the ventilation around the mobo? Maybe heat from the GPUs is destabilizing the CPU.

Edit: It could also be bad RAM. Make note if they are always the same, especially the instruction address.

Try a different power supply.

You'd be amazed how many are bad and have you point the finger somewhere else.
legendary
Activity: 1470
Merit: 1114
July 26, 2016, 04:28:33 PM
#58

EDIT: I do have the temperatures of the cards set at a max.. to not exceed 79c ect.
And I use to get this  but was going to talk about it later.
Going to myr-gr  also on neoscrypt  so not related to algo but memory or ccminer? Maybe intensity setting.


Looks like a null pointer dereference. That's usually software but in your case it could be excess heat in the CPU or RAM.
How is the ventilation around the mobo? Maybe heat from the GPUs is destabilizing the CPU.

Edit: It could also be bad RAM. Make note if they are always the same, especially the instruction address.
legendary
Activity: 3164
Merit: 1003
July 26, 2016, 05:48:19 AM
#57
And where is that command plz on keeping the rig in the administrative mode on reboot. Thx
"c:\progra~1\nvidia\NVSMI\nvidia-smi.exe -acp UNRESTRICTED" removes the admin requirements.....  Run as administrator in command prompt.

I havent tried rebooting to see if this needs repeated on boot.   if so, it will need to be automated at boot-up somehow.

But one p-state issue ive tracked to the neoscrypt algo.   Try removing it for a time and see how much it helps.
Thx JK see my post on zpool plz.
https://bitcointalksearch.org/topic/m.15712991

But I still may give this a try.  Smiley
legendary
Activity: 1848
Merit: 1166
My AR-15 ID's itself as a toaster. Want breakfast?
July 25, 2016, 07:26:26 PM
#56
And where is that command plz on keeping the rig in the administrative mode on reboot. Thx
"c:\progra~1\nvidia\NVSMI\nvidia-smi.exe -acp UNRESTRICTED" removes the admin requirements.....  Run as administrator in command prompt.

I havent tried rebooting to see if this needs repeated on boot.   if so, it will need to be automated at boot-up somehow.

But one p-state issue ive tracked to the neoscrypt algo.   Try removing it for a time and see how much it helps.
legendary
Activity: 3164
Merit: 1003
July 25, 2016, 06:08:09 AM
#55
And where is that command plz on keeping the rig in the administrative mode on reboot. Thx
legendary
Activity: 3164
Merit: 1003
July 25, 2016, 05:38:20 AM
#54
Thx JK I have done this.   Will that keep them in the p0 state on reboot?
EDIT: I'm going to give this a try again asap. thx

No.. this will have to have admin rights first deactivated (check zpool thread for command) then reboot, and it shouldn't need admin rights anymore.

then add the path to the NVSMI folder to the system PATH and you can just call it from any old command prompt/provision allotment =)

Issue the memory and clock set command every reboot, or better right before the miner app launches.
Thx JaredKaragen  looking into it. And where is that command plz on keeping the rig in the administrative mode on reboot. Thx
legendary
Activity: 3164
Merit: 1003
July 25, 2016, 05:32:12 AM
#53
A note  7-21-2016
My rig mined all day with all cards with MC. Smiley
Room temp 92f  humidity 35%
For the four days before that the rig crashed at only 87f humidity 85%
Possibility the humidity played some role in the crashing on with the 970gtx that mines in p2 clocks default 1413 core.
Maybe micro dust.
This weeks temp are supposed to be record breaking 100f......... humidity.?    
Slowly one thing at a time to pin down the 3 things contributing to the crashing.
Due to heat I will have to shut down the rig for sometime during the day for the next 5 days.

EDIT:7-22-16 rig mined all day today with MC room temp max 95f hum 45% the only change I did was turn off oc'ing and set  "delay": 15  to "delay": 30.
So one problem is heat related cards changing algo's going from the p2 state to p8 state in order to mine or some call it spin down time.
The hotter it is the longer it takes for spin down.

Interesting observation. I never really considered that it would be more susceptible to crash when switching algos, or that
spin down time could mitigate.  Good to see you're making progress with your "heat chamber" testing.
Thx joblo  yes still in a severe heat wave..never seen this before. One day I turned off the miners until the evening. Today same thing I think 100f  Undecided

EDIT: I do have the temperatures of the cards set at a max.. to not exceed 79c ect.
And I use to get this  but was going to talk about it later.
Going to myr-gr  also on neoscrypt  so not related to algo but memory or ccminer? Maybe intensity setting.
legendary
Activity: 1848
Merit: 1166
My AR-15 ID's itself as a toaster. Want breakfast?
July 24, 2016, 07:51:43 PM
#52
Thx JK I have done this.   Will that keep them in the p0 state on reboot?
EDIT: I'm going to give this a try again asap. thx

No.. this will have to have admin rights first deactivated (check zpool thread for command) then reboot, and it shouldn't need admin rights anymore.

then add the path to the NVSMI folder to the system PATH and you can just call it from any old command prompt/provision allotment =)

Issue the memory and clock set command every reboot, or better right before the miner app launches.
newbie
Activity: 39
Merit: 0
July 24, 2016, 10:30:48 AM
#51
so damn...
great idea. this is what I want first but until now they have not found a good example
my gpu need big upgrade for this Undecided
legendary
Activity: 1470
Merit: 1114
July 24, 2016, 06:48:14 AM
#50
A note  7-21-2016
My rig mined all day with all cards with MC. Smiley
Room temp 92f  humidity 35%
For the four days before that the rig crashed at only 87f humidity 85%
Possibility the humidity played some role in the crashing on with the 970gtx that mines in p2 clocks default 1413 core.
Maybe micro dust.
This weeks temp are supposed to be record breaking 100f......... humidity.?    
Slowly one thing at a time to pin down the 3 things contributing to the crashing.
Due to heat I will have to shut down the rig for sometime during the day for the next 5 days.

EDIT:7-22-16 rig mined all day today with MC room temp max 95f hum 45% the only change I did was turn off oc'ing and set  "delay": 15  to "delay": 30.
So one problem is heat related cards changing algo's going from the p2 state to p8 state in order to mine or some call it spin down time.
The hotter it is the longer it takes for spin down.

Interesting observation. I never really considered that it would be more susceptible to crash when switching algos, or that
spin down time could mitigate.  Good to see you're making progress with your "heat chamber" testing.
legendary
Activity: 3164
Merit: 1003
July 24, 2016, 05:35:14 AM
#49
Here's another useful cut and paste.    I was able to kick my live mining machine from P2 into P0, no interruptions.

I am thinking of building this command into my batch every time it re-launches the miner apps to make sure P-state is zero.

Code:
Run this in cmd in admin mode:
c:\progra~1\nvidia~1\NVSMI\
nvidia-smi -q -d SUPPORTED_CLOCKS | more
(that's a pipe before more)

Take the top number for memory, something like 3505 (GTX970), and the first number for graphics, something like 1531. Again in admin mode, enter your numbers in this format:

nvidia-smi -ac 3505,1531

You're card's memory will now run in compute mode (P0).

nvidia-smi.exe -ac 3505,1443  For my ASUS GTX 980.

Ive also noticed X11evo is a power-hungry algo.  One machine used to sit at 80-84*C when running only X11EVO.  Now Running Lyra2 full-time and it sits @65*C...


Thx JK I have done this.   Will that keep them in the p0 state on reboot?
EDIT: I'm going to give this a try again asap. thx
Pages:
Jump to: