Pages:
Author

Topic: BAMT version 0.5 - Easy USB based mining Linux with farm wide management tools - page 67. (Read 324169 times)

sr. member
Activity: 309
Merit: 250

you can put

  detect_defunct: 0

in settings area of bamt.conf.  that will stop mother from rebooting.  i'd guess that would just leave the thing hung up and not mining every 20 minutes instead of rebooting, but who knows.

if temps are jumping all over the place... well doesn't sound good to me.


You say you have 3 GPU on this rig. Is the GPU in the middle or on an end? Maybe swap them around and see if the problem follows the card. I had 1 of the 5 on one of my rigs that was acting up. I moved it to the end with nothing blocking the fans and it's been stable now for 5 days.

Mother just put noGPU0 back into the ACTIVE directory, so its not even running now, however, it just rebooted again... now its the second core on that same card that got the OC removed... I had it at 800/300... sigh....


BitMinerN8:
Its at the top and currently running  about 10C higher than the middle and bottom card with one core disabled and the other at stock.  I'll try moving it to the "bottom" slot.. I have an open air rig, so the "bottom" slot runs the coolest as nothing is in its way...  I assume there is no way to determine GPU numbers it will be if I move it?  I thought I remember reading somewhere that it was just  based on the motherboard.


If i still have issues, I'll try resetting the thermal paste/heatsink.. and if still issues... I'll finish beating my head on a wall and return it as i just bought it on ebay, lol.
hero member
Activity: 626
Merit: 500
Mining since May 2011.
What does it mean when mother puts noGPUx in the ACTIVE directory, I found out that was the card I was having issues with... I reset it to stock speeds and a short time later it rebooted and put noGPUx in the ACTIVE directory.  I assume this is going badly for me and this core....  Especially since it happened at stock speeds.

yeah that is not good.  actually never heard of it happening at stock speeds.  your fan still turning and whatnot?

the presence of that file means phoenix went defunct, which means your GPU went insane as far as I know.  when the GPU stops responding, phoenix tends to become very upset.

might want to inspect that something isn't burning up on the card.  the temp sensor is not all knowing.  it only knows temp in one place.
if you have another card exhausting onto that one, it could be crazy hot in a place the sensor doesn't really read, stuff like that.

or.. could just be borken

lol... well glad I could be a first to get it to happen at stock   Roll Eyes... At stock BAMT reboots every 20ish minutes due to this card... i tried multiple times removing it from ACTIVE and it just keeps coming back.... I tried removing from ACTIVE, leaving it at stock and clocking the mem to 300, but then it reboots and sets it back to stock mem..... think I might have got bad card off ebay... I dont have any cards venting into it... just 3 cards on the board.  Fan is still running, although sometimes temps jump all over the place, so maybe the fan is cutting in and out... This card has been a pain in my ass, lol.  Thanks for all your help lodcrappo!  I'll go try....  .... something, lol.

you can put

  detect_defunct: 0

in settings area of bamt.conf.  that will stop mother from rebooting.  i'd guess that would just leave the thing hung up and not mining every 20 minutes instead of rebooting, but who knows.

if temps are jumping all over the place... well doesn't sound good to me.


You say you have 3 GPU on this rig. Is the GPU in the middle or on an end? Maybe swap them around and see if the problem follows the card. I had 1 of the 5 on one of my rigs that was acting up. I moved it to the end with nothing blocking the fans and it's been stable now for 5 days.
hero member
Activity: 616
Merit: 506
What does it mean when mother puts noGPUx in the ACTIVE directory, I found out that was the card I was having issues with... I reset it to stock speeds and a short time later it rebooted and put noGPUx in the ACTIVE directory.  I assume this is going badly for me and this core....  Especially since it happened at stock speeds.

yeah that is not good.  actually never heard of it happening at stock speeds.  your fan still turning and whatnot?

the presence of that file means phoenix went defunct, which means your GPU went insane as far as I know.  when the GPU stops responding, phoenix tends to become very upset.

might want to inspect that something isn't burning up on the card.  the temp sensor is not all knowing.  it only knows temp in one place.
if you have another card exhausting onto that one, it could be crazy hot in a place the sensor doesn't really read, stuff like that.

or.. could just be borken

lol... well glad I could be a first to get it to happen at stock   Roll Eyes... At stock BAMT reboots every 20ish minutes due to this card... i tried multiple times removing it from ACTIVE and it just keeps coming back.... I tried removing from ACTIVE, leaving it at stock and clocking the mem to 300, but then it reboots and sets it back to stock mem..... think I might have got bad card off ebay... I dont have any cards venting into it... just 3 cards on the board.  Fan is still running, although sometimes temps jump all over the place, so maybe the fan is cutting in and out... This card has been a pain in my ass, lol.  Thanks for all your help lodcrappo!  I'll go try....  .... something, lol.

you can put

  detect_defunct: 0

in settings area of bamt.conf.  that will stop mother from rebooting.  i'd guess that would just leave the thing hung up and not mining every 20 minutes instead of rebooting, but who knows.

if temps are jumping all over the place... well doesn't sound good to me.

one possibility: reseat the heatsink/apply thermal paste better (not more, just better, usually)/stuff like that
sr. member
Activity: 309
Merit: 250
What does it mean when mother puts noGPUx in the ACTIVE directory, I found out that was the card I was having issues with... I reset it to stock speeds and a short time later it rebooted and put noGPUx in the ACTIVE directory.  I assume this is going badly for me and this core....  Especially since it happened at stock speeds.

yeah that is not good.  actually never heard of it happening at stock speeds.  your fan still turning and whatnot?

the presence of that file means phoenix went defunct, which means your GPU went insane as far as I know.  when the GPU stops responding, phoenix tends to become very upset.

might want to inspect that something isn't burning up on the card.  the temp sensor is not all knowing.  it only knows temp in one place.
if you have another card exhausting onto that one, it could be crazy hot in a place the sensor doesn't really read, stuff like that.

or.. could just be borken

lol... well glad I could be a first to get it to happen at stock   Roll Eyes... At stock BAMT reboots every 20ish minutes due to this card... i tried multiple times removing it from ACTIVE and it just keeps coming back.... I tried removing from ACTIVE, leaving it at stock and clocking the mem to 300, but then it reboots and sets it back to stock mem..... think I might have got bad card off ebay... I dont have any cards venting into it... just 3 cards on the board.  Fan is still running, although sometimes temps jump all over the place, so maybe the fan is cutting in and out... This card has been a pain in my ass, lol.  Thanks for all your help lodcrappo!  I'll go try....  .... something, lol.
hero member
Activity: 616
Merit: 506
OK I know this is problem a noob? But the configuration file that comes up at startup is that cgminer. File or do I have to set it up to use cgminer?


if you are seeing a config file opening at startup, you are using an outdated version that probably doesn't even support cgminer.

please use a current (0.5c) image.

you will find an example config with all options detailed in /opt/bamt/examples, or on our Wiki in the Examples section.  These documents apply only to the current version.


hero member
Activity: 502
Merit: 500
OK I know this is problem a noob? But the configuration file that comes up at startup is that cgminer. File or do I have to set it up to use cgminer?
hero member
Activity: 616
Merit: 506
What does it mean when mother puts noGPUx in the ACTIVE directory, I found out that was the card I was having issues with... I reset it to stock speeds and a short time later it rebooted and put noGPUx in the ACTIVE directory.  I assume this is going badly for me and this core....  Especially since it happened at stock speeds.

yeah that is not good.  actually never heard of it happening at stock speeds.  your fan still turning and whatnot?

the presence of that file means phoenix went defunct, which means your GPU went insane as far as I know.  when the GPU stops responding, phoenix tends to become very upset.

might want to inspect that something isn't burning up on the card.  the temp sensor is not all knowing.  it only knows temp in one place.
if you have another card exhausting onto that one, it could be crazy hot in a place the sensor doesn't really read, stuff like that.

or.. could just be borken

hero member
Activity: 616
Merit: 506

That is good to hear.

Although I am not 100% positive about this, it seems the issues some people reported with cgminer are now resolved.

Two changes have been made:

First, we've brought cgminer up to version 2.3.1.  This version is reported to be more stable than previous versions.  

Second, I reduced the number of calls BAMT makes to the cgminer API.  It may be simply that too many calls to the API causes cgminer crashes, and that is why some people had problems with stability.   This would explain why people with more GPUs saw more trouble, since more GPUs = more calls to the API.  It would also explain why disabling the BAMT monitoring allowed cgminer to run better on an otherwise unchanged platform.  The theory seems possible, but I cannot be sure of this.

In any case, it seems BAMT 0.5c (or an earlier image with fixes through 14) is providing a stable cgminer platform.


I've been running 2.3.1 on a 6 card rig with BAMT .4 for 12 days with no issues, so if the amount of API calls changed from .4 to .5 that may have been the issue, otherwise it may just be 2.3.1 is more stable.

yeah it might just be 2.3.1 that has brought peace to the realm.  the reduction in api calls I made only yesterday, or maybe it was the day before.  anyway long past any 0.4 version.

donator
Activity: 798
Merit: 500

That is good to hear.

Although I am not 100% positive about this, it seems the issues some people reported with cgminer are now resolved.

Two changes have been made:

First, we've brought cgminer up to version 2.3.1.  This version is reported to be more stable than previous versions.  

Second, I reduced the number of calls BAMT makes to the cgminer API.  It may be simply that too many calls to the API causes cgminer crashes, and that is why some people had problems with stability.   This would explain why people with more GPUs saw more trouble, since more GPUs = more calls to the API.  It would also explain why disabling the BAMT monitoring allowed cgminer to run better on an otherwise unchanged platform.  The theory seems possible, but I cannot be sure of this.

In any case, it seems BAMT 0.5c (or an earlier image with fixes through 14) is providing a stable cgminer platform.


I've been running 2.3.1 on a 6 card rig with BAMT .4 for 12 days with no issues, so if the amount of API calls changed from .4 to .5 that may have been the issue, otherwise it may just be 2.3.1 is more stable.
sr. member
Activity: 309
Merit: 250
What does it mean when mother puts noGPUx in the ACTIVE directory, I found out that was the card I was having issues with... I reset it to stock speeds and a short time later it rebooted and put noGPUx in the ACTIVE directory.  I assume this is going badly for me and this core....  Especially since it happened at stock speeds.
sr. member
Activity: 309
Merit: 250
Not a huge deal, but something I have noticed is if I make config changes through gpumon, then hit Shift-R to restart the processes... randomly it does a reboot instead (I hit shift-r, it stops the processes and says "the system is going down for reboot now" during that time).  it comes back up with the new config and it works fine... Any ideas?


Doesn't happen to me unless the OC is unstable.

that would make sense.

there is only one condition which will cause an automatic system reboot: a phoenix process stuck in the DEFUNCT state.

the only way I know of to make phoenix enter the defunct state is to lock up your GPU.

when overclocking is right on the edge of stability, sometimes the very process of stopping phoenix causes the GPU to lock up.  
i was helping another user with exactly that problem yesterday (GPUs that hang when phoenix exits).  reducing overclocking made the problem go away.

maybe its the sudden drop in temperature, maybe its something with how phoenix exits, maybe something else.. but it seems to be enough to push a card teetering on the edge of a crash right on over into crazy town.

if you're seeing this only occasionally, try backing off 5mhz or so.  it will probably go away, and you might avoid having a system that locks up "for no reason" every few weeks.


Cool, thx!  Just thought it was weird that it did it on the restart of the process.  I'll adjust the clocks and see if it keeps it from going into "crazy town"  Smiley
hero member
Activity: 616
Merit: 506
Not a huge deal, but something I have noticed is if I make config changes through gpumon, then hit Shift-R to restart the processes... randomly it does a reboot instead (I hit shift-r, it stops the processes and says "the system is going down for reboot now" during that time).  it comes back up with the new config and it works fine... Any ideas?


Doesn't happen to me unless the OC is unstable.

that would make sense.

there is only one condition which will cause an automatic system reboot: a phoenix process stuck in the DEFUNCT state.

the only way I know of to make phoenix enter the defunct state is to lock up your GPU.

when overclocking is right on the edge of stability, sometimes the very process of stopping phoenix causes the GPU to lock up.  
i was helping another user with exactly that problem yesterday (GPUs that hang when phoenix exits).  reducing overclocking made the problem go away.

maybe its the sudden drop in temperature, maybe its something with how phoenix exits, maybe something else.. but it seems to be enough to push a card teetering on the edge of a crash right on over into crazy town.

if you're seeing this only occasionally, try backing off 5mhz or so.  it will probably go away, and you might avoid having a system that locks up "for no reason" every few weeks.
legendary
Activity: 3472
Merit: 1722
Not a huge deal, but something I have noticed is if I make config changes through gpumon, then hit Shift-R to restart the processes... randomly it does a reboot instead (I hit shift-r, it stops the processes and says "the system is going down for reboot now" during that time).  it comes back up with the new config and it works fine... Any ideas?


Doesn't happen to me unless the OC is unstable.
hero member
Activity: 616
Merit: 506
The other person with a problem like this just had bad parameters in their config.

Is there a reason you are setting voltage?  Especially why set it in all 3 profiles?  This may be the problem
You also don't need to set the GPU clock in profiles 0 and 1.  

Try removing the voltage entirely and the GPU clock except for profile 2.

Alright so (each line is two spaces then command):
gpu0:

  # remove disabled: or set it to 0 to actually use this card..

  disabled: 0
  debug_oc: 1
  #core_speed_0: 800
  #core_speed_1: 850
  core_speed_2: 900

 # mem_speed_0: 300
 # mem_speed_1: 300
  mem_speed_2: 300

 #core_voltage_0: 1.125
 #core_voltage_1: 1.125
 #core_voltage_2: 1.125

Restarting gets:

--[ Debug info for O/C on GPU 0 ]------------------------------------------------

GPU is enabled, overclocking is enabled

OC command - profile 2: DISPLAY=:0.0 /usr/local/bin/atitweak -P 2 -A 0 -e 900 -m 300

Results:
Setting performance level 2 on adapter 0: engine clock 900MHz, memory clock 300MHz
ADL_Overdrive5_ODPerformanceLevels_Set failed.



turn mem_speed back on for 0 and 1.   your card probably has a default speed higher than 300 in profile 1, if not 0.  that will prevent you from setting 2 to 300.  higher profile cannot have lower values than lower profile, basic rule of overclocking (some cards don't care, some do).
full member
Activity: 128
Merit: 100
The other person with a problem like this just had bad parameters in their config.

Is there a reason you are setting voltage?  Especially why set it in all 3 profiles?  This may be the problem
You also don't need to set the GPU clock in profiles 0 and 1. 

Try removing the voltage entirely and the GPU clock except for profile 2.

Alright so (each line is two spaces then command):
gpu0:

  # remove disabled: or set it to 0 to actually use this card..

  disabled: 0
  debug_oc: 1
  #core_speed_0: 800
  #core_speed_1: 850
  core_speed_2: 900

 # mem_speed_0: 300
 # mem_speed_1: 300
  mem_speed_2: 300

 #core_voltage_0: 1.125
 #core_voltage_1: 1.125
 #core_voltage_2: 1.125

Restarting gets:

--[ Debug info for O/C on GPU 0 ]------------------------------------------------

GPU is enabled, overclocking is enabled

OC command - profile 2: DISPLAY=:0.0 /usr/local/bin/atitweak -P 2 -A 0 -e 900 -m 300

Results:
Setting performance level 2 on adapter 0: engine clock 900MHz, memory clock 300MHz
ADL_Overdrive5_ODPerformanceLevels_Set failed.

sr. member
Activity: 309
Merit: 250
Not a huge deal, but something I have noticed is if I make config changes through gpumon, then hit Shift-R to restart the processes... randomly it does a reboot instead (I hit shift-r, it stops the processes and says "the system is going down for reboot now" during that time).  it comes back up with the new config and it works fine... Any ideas?
hero member
Activity: 616
Merit: 506
I'm not sure if this was ever resolved since aside from being a Linux noob I think this kind of trailed off as far as I could follow.

None of the overclocking configurations in the bamt.conf work on my HIS 5850.  They tweak my Sapphire 5830 fine and when I used to only run the 5850 under and older version of BAMT I think it was 0.3? the settings worked correctly.

--[ Debug info for O/C on GPU 0 ]------------------------------------------------

GPU is enabled, overclocking is enabled

OC command - profile 0: DISPLAY=:0.0 /usr/local/bin/atitweak -P 0 -A 0 -e 800 -m 300 -v 1.125

Results:
Setting performance level 0 on adapter 0: engine clock 800MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.


OC command - profile 1: DISPLAY=:0.0 /usr/local/bin/atitweak -P 1 -A 0 -e 850 -m 300 -v 1.125

Results:
Setting performance level 1 on adapter 0: engine clock 850MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.


OC command - profile 2: DISPLAY=:0.0 /usr/local/bin/atitweak -P 2 -A 0 -e 900 -m 300 -v 1.125

Results:
Setting performance level 2 on adapter 0: engine clock 900MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.


The other person with a problem like this just had bad parameters in their config.

Is there a reason you are setting voltage?  Especially why set it in all 3 profiles?  This may be the problem
You also don't need to set the GPU clock in profiles 0 and 1. 

Try removing the voltage entirely and the GPU clock except for profile 2.
full member
Activity: 128
Merit: 100
I'm not sure if this was ever resolved since aside from being a Linux noob I think this kind of trailed off as far as I could follow.

None of the overclocking configurations in the bamt.conf work on my HIS 5850.  They tweak my Sapphire 5830 fine and when I used to only run the 5850 under and older version of BAMT I think it was 0.3? the settings worked correctly.

--[ Debug info for O/C on GPU 0 ]------------------------------------------------

GPU is enabled, overclocking is enabled

OC command - profile 0: DISPLAY=:0.0 /usr/local/bin/atitweak -P 0 -A 0 -e 800 -m 300 -v 1.125

Results:
Setting performance level 0 on adapter 0: engine clock 800MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.


OC command - profile 1: DISPLAY=:0.0 /usr/local/bin/atitweak -P 1 -A 0 -e 850 -m 300 -v 1.125

Results:
Setting performance level 1 on adapter 0: engine clock 850MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.


OC command - profile 2: DISPLAY=:0.0 /usr/local/bin/atitweak -P 2 -A 0 -e 900 -m 300 -v 1.125

Results:
Setting performance level 2 on adapter 0: engine clock 900MHz, memory clock 300MHz, core voltage 1.125VDC
ADL_Overdrive5_ODPerformanceLevels_Set failed.
hero member
Activity: 616
Merit: 506
I have just upgraded my entire farm after testing 0.5 on 4 rigs and it seems to solve most issues cgminer was having!

Thanks lodcrappo!

That is good to hear.

Although I am not 100% positive about this, it seems the issues some people reported with cgminer are now resolved.

Two changes have been made:

First, we've brought cgminer up to version 2.3.1.  This version is reported to be more stable than previous versions.  

Second, I reduced the number of calls BAMT makes to the cgminer API.  It may be simply that too many calls to the API causes cgminer crashes, and that is why some people had problems with stability.   This would explain why people with more GPUs saw more trouble, since more GPUs = more calls to the API.  It would also explain why disabling the BAMT monitoring allowed cgminer to run better on an otherwise unchanged platform.  The theory seems possible, but I cannot be sure of this.

In any case, it seems BAMT 0.5c (or an earlier image with fixes through 14) is providing a stable cgminer platform.
vip
Activity: 1358
Merit: 1000
AKA: gigavps
I have just upgraded my entire farm after testing 0.5 on 4 rigs and it seems to solve most issues cgminer was having!

Thanks lodcrappo!
Pages:
Jump to: