BAMT version 0.5 - Easy USB based mining Linux with farm wide management tools - page 67.

boozer

sr. member

Activity: 309

Merit: 250

Quote from: BitMinerN8 on March 05, 2012, 11:02:38 PM

Quote from: lodcrappo on March 05, 2012, 10:58:05 PM

you can put

detect_defunct: 0

in settings area of bamt.conf. that will stop mother from rebooting. i'd guess that would just leave the thing hung up and not mining every 20 minutes instead of rebooting, but who knows.

if temps are jumping all over the place... well doesn't sound good to me.

You say you have 3 GPU on this rig. Is the GPU in the middle or on an end? Maybe swap them around and see if the problem follows the card. I had 1 of the 5 on one of my rigs that was acting up. I moved it to the end with nothing blocking the fans and it's been stable now for 5 days.

Mother just put noGPU0 back into the ACTIVE directory, so its not even running now, however, it just rebooted again... now its the second core on that same card that got the OC removed... I had it at 800/300... sigh....

BitMinerN8:
Its at the top and currently running about 10C higher than the middle and bottom card with one core disabled and the other at stock. I'll try moving it to the "bottom" slot.. I have an open air rig, so the "bottom" slot runs the coolest as nothing is in its way... I assume there is no way to determine GPU numbers it will be if I move it? I thought I remember reading somewhere that it was just based on the motherboard.

If i still have issues, I'll try resetting the thermal paste/heatsink.. and if still issues... I'll finish beating my head on a wall and return it as i just bought it on ebay, lol.

BitMinerN8

hero member

Activity: 626

Merit: 500

Mining since May 2011.

Quote from: lodcrappo on March 05, 2012, 10:58:05 PM

Quote from: boozer on March 05, 2012, 10:55:29 PM

Quote from: lodcrappo on March 05, 2012, 10:17:46 PM

Quote from: boozer on March 05, 2012, 09:35:09 PM

What does it mean when mother puts noGPUx in the ACTIVE directory, I found out that was the card I was having issues with... I reset it to stock speeds and a short time later it rebooted and put noGPUx in the ACTIVE directory. I assume this is going badly for me and this core.... Especially since it happened at stock speeds.

yeah that is not good. actually never heard of it happening at stock speeds. your fan still turning and whatnot?

the presence of that file means phoenix went defunct, which means your GPU went insane as far as I know. when the GPU stops responding, phoenix tends to become very upset.

might want to inspect that something isn't burning up on the card. the temp sensor is not all knowing. it only knows temp in one place.
if you have another card exhausting onto that one, it could be crazy hot in a place the sensor doesn't really read, stuff like that.

or.. could just be borken

lol... well glad I could be a first to get it to happen at stock Roll Eyes

... At stock BAMT reboots every 20ish minutes due to this card... i tried multiple times removing it from ACTIVE and it just keeps coming back.... I tried removing from ACTIVE, leaving it at stock and clocking the mem to 300, but then it reboots and sets it back to stock mem..... think I might have got bad card off ebay... I dont have any cards venting into it... just 3 cards on the board. Fan is still running, although sometimes temps jump all over the place, so maybe the fan is cutting in and out... This card has been a pain in my ass, lol. Thanks for all your help lodcrappo! I'll go try.... .... something, lol.

you can put

detect_defunct: 0

in settings area of bamt.conf. that will stop mother from rebooting. i'd guess that would just leave the thing hung up and not mining every 20 minutes instead of rebooting, but who knows.

if temps are jumping all over the place... well doesn't sound good to me.

You say you have 3 GPU on this rig. Is the GPU in the middle or on an end? Maybe swap them around and see if the problem follows the card. I had 1 of the 5 on one of my rigs that was acting up. I moved it to the end with nothing blocking the fans and it's been stable now for 5 days.

lodcrappo

hero member

Activity: 616

Merit: 506

Quote from: boozer on March 05, 2012, 10:55:29 PM

Quote from: lodcrappo on March 05, 2012, 10:17:46 PM

Quote from: boozer on March 05, 2012, 09:35:09 PM

What does it mean when mother puts noGPUx in the ACTIVE directory, I found out that was the card I was having issues with... I reset it to stock speeds and a short time later it rebooted and put noGPUx in the ACTIVE directory. I assume this is going badly for me and this core.... Especially since it happened at stock speeds.

yeah that is not good. actually never heard of it happening at stock speeds. your fan still turning and whatnot?

the presence of that file means phoenix went defunct, which means your GPU went insane as far as I know. when the GPU stops responding, phoenix tends to become very upset.

might want to inspect that something isn't burning up on the card. the temp sensor is not all knowing. it only knows temp in one place.
if you have another card exhausting onto that one, it could be crazy hot in a place the sensor doesn't really read, stuff like that.

or.. could just be borken

lol... well glad I could be a first to get it to happen at stock Roll Eyes

... At stock BAMT reboots every 20ish minutes due to this card... i tried multiple times removing it from ACTIVE and it just keeps coming back.... I tried removing from ACTIVE, leaving it at stock and clocking the mem to 300, but then it reboots and sets it back to stock mem..... think I might have got bad card off ebay... I dont have any cards venting into it... just 3 cards on the board. Fan is still running, although sometimes temps jump all over the place, so maybe the fan is cutting in and out... This card has been a pain in my ass, lol. Thanks for all your help lodcrappo! I'll go try.... .... something, lol.

you can put

detect_defunct: 0

in settings area of bamt.conf. that will stop mother from rebooting. i'd guess that would just leave the thing hung up and not mining every 20 minutes instead of rebooting, but who knows.

if temps are jumping all over the place... well doesn't sound good to me.

one possibility: reseat the heatsink/apply thermal paste better (not more, just better, usually)/stuff like that

boozer

sr. member

Activity: 309

Merit: 250

Quote from: lodcrappo on March 05, 2012, 10:17:46 PM

Quote from: boozer on March 05, 2012, 09:35:09 PM

What does it mean when mother puts noGPUx in the ACTIVE directory, I found out that was the card I was having issues with... I reset it to stock speeds and a short time later it rebooted and put noGPUx in the ACTIVE directory. I assume this is going badly for me and this core.... Especially since it happened at stock speeds.

yeah that is not good. actually never heard of it happening at stock speeds. your fan still turning and whatnot?

the presence of that file means phoenix went defunct, which means your GPU went insane as far as I know. when the GPU stops responding, phoenix tends to become very upset.

might want to inspect that something isn't burning up on the card. the temp sensor is not all knowing. it only knows temp in one place.
if you have another card exhausting onto that one, it could be crazy hot in a place the sensor doesn't really read, stuff like that.

or.. could just be borken

lol... well glad I could be a first to get it to happen at stock Roll Eyes

... At stock BAMT reboots every 20ish minutes due to this card... i tried multiple times removing it from ACTIVE and it just keeps coming back.... I tried removing from ACTIVE, leaving it at stock and clocking the mem to 300, but then it reboots and sets it back to stock mem..... think I might have got bad card off ebay... I dont have any cards venting into it... just 3 cards on the board. Fan is still running, although sometimes temps jump all over the place, so maybe the fan is cutting in and out... This card has been a pain in my ass, lol. Thanks for all your help lodcrappo! I'll go try.... .... something, lol.

lodcrappo

hero member

Activity: 616

Merit: 506

Quote from: dmcurser on March 05, 2012, 10:37:45 PM

OK I know this is problem a noob? But the configuration file that comes up at startup is that cgminer. File or do I have to set it up to use cgminer?

if you are seeing a config file opening at startup, you are using an outdated version that probably doesn't even support cgminer.

please use a current (0.5c) image.

you will find an example config with all options detailed in /opt/bamt/examples, or on our Wiki in the Examples section. These documents apply only to the current version.

dmcurser

hero member

Activity: 502

Merit: 500

OK I know this is problem a noob? But the configuration file that comes up at startup is that cgminer. File or do I have to set it up to use cgminer?

lodcrappo

hero member

Activity: 616

Merit: 506

Quote from: boozer on March 05, 2012, 09:35:09 PM

What does it mean when mother puts noGPUx in the ACTIVE directory, I found out that was the card I was having issues with... I reset it to stock speeds and a short time later it rebooted and put noGPUx in the ACTIVE directory. I assume this is going badly for me and this core.... Especially since it happened at stock speeds.

yeah that is not good. actually never heard of it happening at stock speeds. your fan still turning and whatnot?

the presence of that file means phoenix went defunct, which means your GPU went insane as far as I know. when the GPU stops responding, phoenix tends to become very upset.

might want to inspect that something isn't burning up on the card. the temp sensor is not all knowing. it only knows temp in one place.
if you have another card exhausting onto that one, it could be crazy hot in a place the sensor doesn't really read, stuff like that.

or.. could just be borken

lodcrappo

hero member

Activity: 616

Merit: 506

Quote from: gnar1ta$ on March 05, 2012, 09:48:42 PM

Quote from: lodcrappo on March 04, 2012, 10:31:13 PM

That is good to hear.

Although I am not 100% positive about this, it seems the issues some people reported with cgminer are now resolved.

Two changes have been made:

First, we've brought cgminer up to version 2.3.1. This version is reported to be more stable than previous versions.

Second, I reduced the number of calls BAMT makes to the cgminer API. It may be simply that too many calls to the API causes cgminer crashes, and that is why some people had problems with stability. This would explain why people with more GPUs saw more trouble, since more GPUs = more calls to the API. It would also explain why disabling the BAMT monitoring allowed cgminer to run better on an otherwise unchanged platform. The theory seems possible, but I cannot be sure of this.

In any case, it seems BAMT 0.5c (or an earlier image with fixes through 14) is providing a stable cgminer platform.

I've been running 2.3.1 on a 6 card rig with BAMT .4 for 12 days with no issues, so if the amount of API calls changed from .4 to .5 that may have been the issue, otherwise it may just be 2.3.1 is more stable.

yeah it might just be 2.3.1 that has brought peace to the realm. the reduction in api calls I made only yesterday, or maybe it was the day before. anyway long past any 0.4 version.

gnar1ta$

donator

Activity: 798

Merit: 500

Quote from: lodcrappo on March 04, 2012, 10:31:13 PM

That is good to hear.

Although I am not 100% positive about this, it seems the issues some people reported with cgminer are now resolved.

Two changes have been made:

First, we've brought cgminer up to version 2.3.1. This version is reported to be more stable than previous versions.

Second, I reduced the number of calls BAMT makes to the cgminer API. It may be simply that too many calls to the API causes cgminer crashes, and that is why some people had problems with stability. This would explain why people with more GPUs saw more trouble, since more GPUs = more calls to the API. It would also explain why disabling the BAMT monitoring allowed cgminer to run better on an otherwise unchanged platform. The theory seems possible, but I cannot be sure of this.

In any case, it seems BAMT 0.5c (or an earlier image with fixes through 14) is providing a stable cgminer platform.

I've been running 2.3.1 on a 6 card rig with BAMT .4 for 12 days with no issues, so if the amount of API calls changed from .4 to .5 that may have been the issue, otherwise it may just be 2.3.1 is more stable.

boozer

sr. member

Activity: 309

Merit: 250

What does it mean when mother puts noGPUx in the ACTIVE directory, I found out that was the card I was having issues with... I reset it to stock speeds and a short time later it rebooted and put noGPUx in the ACTIVE directory. I assume this is going badly for me and this core.... Especially since it happened at stock speeds.

boozer

sr. member

Activity: 309

Merit: 250

Quote from: lodcrappo on March 05, 2012, 05:13:10 PM

Quote from: malevolent on March 05, 2012, 04:08:30 PM

Quote from: boozer on March 05, 2012, 02:25:24 PM

Not a huge deal, but something I have noticed is if I make config changes through gpumon, then hit Shift-R to restart the processes... randomly it does a reboot instead (I hit shift-r, it stops the processes and says "the system is going down for reboot now" during that time). it comes back up with the new config and it works fine... Any ideas?

Doesn't happen to me unless the OC is unstable.

that would make sense.

there is only one condition which will cause an automatic system reboot: a phoenix process stuck in the DEFUNCT state.

the only way I know of to make phoenix enter the defunct state is to lock up your GPU.

when overclocking is right on the edge of stability, sometimes the very process of stopping phoenix causes the GPU to lock up.
i was helping another user with exactly that problem yesterday (GPUs that hang when phoenix exits). reducing overclocking made the problem go away.

maybe its the sudden drop in temperature, maybe its something with how phoenix exits, maybe something else.. but it seems to be enough to push a card teetering on the edge of a crash right on over into crazy town.

if you're seeing this only occasionally, try backing off 5mhz or so. it will probably go away, and you might avoid having a system that locks up "for no reason" every few weeks.

Cool, thx! Just thought it was weird that it did it on the restart of the process. I'll adjust the clocks and see if it keeps it from going into "crazy town"

lodcrappo

hero member

Activity: 616

Merit: 506

Quote from: malevolent on March 05, 2012, 04:08:30 PM

Quote from: boozer on March 05, 2012, 02:25:24 PM

Not a huge deal, but something I have noticed is if I make config changes through gpumon, then hit Shift-R to restart the processes... randomly it does a reboot instead (I hit shift-r, it stops the processes and says "the system is going down for reboot now" during that time). it comes back up with the new config and it works fine... Any ideas?

Doesn't happen to me unless the OC is unstable.

that would make sense.

there is only one condition which will cause an automatic system reboot: a phoenix process stuck in the DEFUNCT state.

the only way I know of to make phoenix enter the defunct state is to lock up your GPU.

when overclocking is right on the edge of stability, sometimes the very process of stopping phoenix causes the GPU to lock up.
i was helping another user with exactly that problem yesterday (GPUs that hang when phoenix exits). reducing overclocking made the problem go away.

maybe its the sudden drop in temperature, maybe its something with how phoenix exits, maybe something else.. but it seems to be enough to push a card teetering on the edge of a crash right on over into crazy town.

if you're seeing this only occasionally, try backing off 5mhz or so. it will probably go away, and you might avoid having a system that locks up "for no reason" every few weeks.

malevolent

legendary

Activity: 3472

Merit: 1724

Quote from: boozer on March 05, 2012, 02:25:24 PM

Not a huge deal, but something I have noticed is if I make config changes through gpumon, then hit Shift-R to restart the processes... randomly it does a reboot instead (I hit shift-r, it stops the processes and says "the system is going down for reboot now" during that time). it comes back up with the new config and it works fine... Any ideas?

Doesn't happen to me unless the OC is unstable.

lodcrappo

hero member

Activity: 616

Merit: 506

Quote from: Intention on March 05, 2012, 03:42:44 PM

Quote from: lodcrappo on March 05, 2012, 01:00:03 AM

The other person with a problem like this just had bad parameters in their config.

Is there a reason you are setting voltage? Especially why set it in all 3 profiles? This may be the problem
You also don't need to set the GPU clock in profiles 0 and 1.

Try removing the voltage entirely and the GPU clock except for profile 2.

Alright so (each line is two spaces then command):
gpu0:

  # remove disabled: or set it to 0 to actually use this card..

  disabled: 0
  debug_oc: 1
  #core_speed_0: 800
  #core_speed_1: 850
  core_speed_2: 900

# mem_speed_0: 300
# mem_speed_1: 300
  mem_speed_2: 300

#core_voltage_0: 1.125
#core_voltage_1: 1.125
#core_voltage_2: 1.125

Restarting gets:

--[ Debug info for O/C on GPU 0 ]------------------------------------------------

GPU is enabled, overclocking is enabled

OC command - profile 2: DISPLAY=:0.0 /usr/local/bin/atitweak -P 2 -A 0 -e 900 -m 300

Results:
Setting performance level 2 on adapter 0: engine clock 900MHz, memory clock 300MHz
ADL_Overdrive5_ODPerformanceLevels_Set failed.

turn mem_speed back on for 0 and 1. your card probably has a default speed higher than 300 in profile 1, if not 0. that will prevent you from setting 2 to 300. higher profile cannot have lower values than lower profile, basic rule of overclocking (some cards don't care, some do).

Intention

full member

Activity: 128

Merit: 100

Quote from: lodcrappo on March 05, 2012, 01:00:03 AM

The other person with a problem like this just had bad parameters in their config.

Is there a reason you are setting voltage? Especially why set it in all 3 profiles? This may be the problem
You also don't need to set the GPU clock in profiles 0 and 1.

Try removing the voltage entirely and the GPU clock except for profile 2.

Alright so (each line is two spaces then command):
gpu0:

# remove disabled: or set it to 0 to actually use this card..

disabled: 0
debug_oc: 1
#core_speed_0: 800
#core_speed_1: 850
core_speed_2: 900

# mem_speed_0: 300
# mem_speed_1: 300
mem_speed_2: 300

#core_voltage_0: 1.125
#core_voltage_1: 1.125
#core_voltage_2: 1.125

Restarting gets:

--[ Debug info for O/C on GPU 0 ]------------------------------------------------

GPU is enabled, overclocking is enabled

OC command - profile 2: DISPLAY=:0.0 /usr/local/bin/atitweak -P 2 -A 0 -e 900 -m 300

Results:
Setting performance level 2 on adapter 0: engine clock 900MHz, memory clock 300MHz
ADL_Overdrive5_ODPerformanceLevels_Set failed.

boozer

sr. member

Activity: 309

Merit: 250

Not a huge deal, but something I have noticed is if I make config changes through gpumon, then hit Shift-R to restart the processes... randomly it does a reboot instead (I hit shift-r, it stops the processes and says "the system is going down for reboot now" during that time). it comes back up with the new config and it works fine... Any ideas?

lodcrappo

hero member

Activity: 616

Merit: 506

Quote from: Intention on March 05, 2012, 12:55:48 AM

I'm not sure if this was ever resolved since aside from being a Linux noob I think this kind of trailed off as far as I could follow.

None of the overclocking configurations in the bamt.conf work on my HIS 5850. They tweak my Sapphire 5830 fine and when I used to only run the 5850 under and older version of BAMT I think it was 0.3? the settings worked correctly.

--[ Debug info for O/C on GPU 0 ]------------------------------------------------

GPU is enabled, overclocking is enabled

OC command - profile 0: DISPLAY=:0.0 /usr/local/bin/atitweak -P 0 -A 0 -e 800 -m 300 -v 1.125

Results:
Setting performance level 0 on adapter 0: engine clock 800MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.

OC command - profile 1: DISPLAY=:0.0 /usr/local/bin/atitweak -P 1 -A 0 -e 850 -m 300 -v 1.125

Results:
Setting performance level 1 on adapter 0: engine clock 850MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.

OC command - profile 2: DISPLAY=:0.0 /usr/local/bin/atitweak -P 2 -A 0 -e 900 -m 300 -v 1.125

Results:
Setting performance level 2 on adapter 0: engine clock 900MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.

The other person with a problem like this just had bad parameters in their config.

Is there a reason you are setting voltage? Especially why set it in all 3 profiles? This may be the problem
You also don't need to set the GPU clock in profiles 0 and 1.

Try removing the voltage entirely and the GPU clock except for profile 2.

Intention

full member

Activity: 128

Merit: 100

I'm not sure if this was ever resolved since aside from being a Linux noob I think this kind of trailed off as far as I could follow.

None of the overclocking configurations in the bamt.conf work on my HIS 5850. They tweak my Sapphire 5830 fine and when I used to only run the 5850 under and older version of BAMT I think it was 0.3? the settings worked correctly.

--[ Debug info for O/C on GPU 0 ]------------------------------------------------

GPU is enabled, overclocking is enabled

OC command - profile 0: DISPLAY=:0.0 /usr/local/bin/atitweak -P 0 -A 0 -e 800 -m 300 -v 1.125

Results:
Setting performance level 0 on adapter 0: engine clock 800MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.

OC command - profile 1: DISPLAY=:0.0 /usr/local/bin/atitweak -P 1 -A 0 -e 850 -m 300 -v 1.125

Results:
Setting performance level 1 on adapter 0: engine clock 850MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.

OC command - profile 2: DISPLAY=:0.0 /usr/local/bin/atitweak -P 2 -A 0 -e 900 -m 300 -v 1.125

Results:
Setting performance level 2 on adapter 0: engine clock 900MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.

lodcrappo

hero member

Activity: 616

Merit: 506

Quote from: jamesg on March 04, 2012, 10:58:19 AM

I have just upgraded my entire farm after testing 0.5 on 4 rigs and it seems to solve most issues cgminer was having!

Thanks lodcrappo!

That is good to hear.

Although I am not 100% positive about this, it seems the issues some people reported with cgminer are now resolved.

Two changes have been made:

First, we've brought cgminer up to version 2.3.1. This version is reported to be more stable than previous versions.

Second, I reduced the number of calls BAMT makes to the cgminer API. It may be simply that too many calls to the API causes cgminer crashes, and that is why some people had problems with stability. This would explain why people with more GPUs saw more trouble, since more GPUs = more calls to the API. It would also explain why disabling the BAMT monitoring allowed cgminer to run better on an otherwise unchanged platform. The theory seems possible, but I cannot be sure of this.

In any case, it seems BAMT 0.5c (or an earlier image with fixes through 14) is providing a stable cgminer platform.

jamesg

vip

Activity: 1358

Merit: 1000

AKA: gigavps

I have just upgraded my entire farm after testing 0.5 on 4 rigs and it seems to solve most issues cgminer was having!

Thanks lodcrappo!

Topic: BAMT version 0.5 - Easy USB based mining Linux with farm wide management tools - page 67. (Read 324186 times)