Pages:
Author

Topic: BAMT - Easy persistent USB key based linux for dedicated miners/mining farms - page 52. (Read 167481 times)

hero member
Activity: 616
Merit: 506
Wanted to thank you for BAMT (already did on the bitclockers forum the other day) and crosspost this from there.

Quote
I've noticed a curious phenomenon. After running at 98-99% load for about 15 minutes, the second card in my rig (a 5870) drops to 75-80% load and stays there, and my hash rate drops the same. It's not a matter of overheating as it has a big aftermarket cooler, and temps stay around 58 degrees at 99% load (drop to 55 degrees at 75-80% - for comparison the other card is around 65 degrees at load, that's a 5850). I've played with various kernels, various aggression and workload settings but none seem to prevent this behavior. If I restart the mine then again the 2nd card is at 99% load for 10-15 minutes then drops to 80%.

I have no ideas how to troubleshoot this any further.

48h later, the problem still persists. Occasionally I've seen the load on that card go even lower, down to 40-45%. It will occasionally recover by itself, but often needs a mine restart. And sometimes even that doesn't fix it. I tried several other things, like -q 2 but even that doesn't fix it.

I thought maybe it's overhead due to having X running on a headless box, bit of a waste if you ask me. I don't have much experience with Debian (I'm more of a Slackware guy) so I killed X as best I could but that brought the whole mine down unfortunately. I tried various kernels, including the latest svn of phatk (which I'm still running right now) with again no change in this behavior.

have you installed all the fixes?  your symptoms sound very much like the screensaver is running, which can happen in some situations (never figured out why, but pretty much forced it dead in one of the updates).

edit - even if you have installed fixes, might also want to:

cp /opt/bamt/.xscreensaver /home/user/

and reboot.  worth a shot.
newbie
Activity: 56
Merit: 0
Wanted to thank you for BAMT (already did on the bitclockers forum the other day) and crosspost this from there.

Quote
I've noticed a curious phenomenon. After running at 98-99% load for about 15 minutes, the second card in my rig (a 5870) drops to 75-80% load and stays there, and my hash rate drops the same. It's not a matter of overheating as it has a big aftermarket cooler, and temps stay around 58 degrees at 99% load (drop to 55 degrees at 75-80% - for comparison the other card is around 65 degrees at load, that's a 5850). I've played with various kernels, various aggression and workload settings but none seem to prevent this behavior. If I restart the mine then again the 2nd card is at 99% load for 10-15 minutes then drops to 80%.

I have no ideas how to troubleshoot this any further.

48h later, the problem still persists. Occasionally I've seen the load on that card go even lower, down to 40-45%. It will occasionally recover by itself, but often needs a mine restart. And sometimes even that doesn't fix it. I tried several other things, like -q 2 but even that doesn't fix it.

I thought maybe it's overhead due to having X running on a headless box, bit of a waste if you ask me. I don't have much experience with Debian (I'm more of a Slackware guy) so I killed X as best I could but that brought the whole mine down unfortunately. I tried various kernels, including the latest svn of phatk (which I'm still running right now) with again no change in this behavior.
hero member
Activity: 616
Merit: 506
Ok fix #10 is ready.  Pretty soon going to need to just release an updated .img 


This fix mainly adds the 'atitweak' utility from:
http://forum.bitcoin.org/index.php?topic=25750.0
which is now used in the overclocking and status gathering routines.
Also, there are some new per GPU config options:

  core_voltage: 
  pre_oc_cmd:
  post_oc_cmd:

core_voltage should be pretty obvious.  this feature has been requested several times and atitweak provides a good way to implement it.  be careful with this one, not hard to crash things when playing with voltages.   pre and post_oc_cmd: let you specify a command (or script) to be run prior to or after BAMT's own calls to atitweak for O/C.  If you want to do something BAMT doesn't do for you, this is your hook.  You can also comment out the BAMT O/C settings and do it all yourself if you prefer.

Warning:

atitweak allows a greater range of values for mem and core clocks than aticonfig does.  it will quite happily let you crash your rig.  please test new settings by using the atitweak command manually before committing them in bamt.conf... otherwise you might find you've created a system that simply locks itself up at boot.  bonus points for locking it up in a way that causes a reboot, enabling the eternal boot cycle (yes I did this earlier tonight Smiley

so, you ignored the warnings and committed the settings of death to bamt.conf.. now what??
recovery is pretty simple.  boot as normal, but when X starts (when you first see a mouse cursor)
press ctrl-alt-1.  This will return you to the text boot console.  Wait a bit, the O/C routines will run but mining will not start.  Since the O/C stuff modifies only profile 2 (but you'll be in profile 0, since no load on gpu) things don't crash.  You'll get a shell prompt where you can su and then edit bamt.conf and try again.

hero member
Activity: 616
Merit: 506
I've most always been able to reboot remotely with a hung card by using force reboot command in Linuxcoin.  As root issue:
Code:
echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger

sudo coldreboot works too, I think you just have to be sure to kill the process with killall -9 python or kill -9 and wait for the crash.

Also I know it's easy enough for user to add, but maybe atitweak would be a good addition to future versions of BAMT.  It's much easier to use than aticonfig and AMDoverdriveCtrl for core, mem, volt, and fan settings.

http://forum.bitcoin.org/index.php?topic=25750.0

I've had some situations where the miner processes go zombie on me, and no amount of killing will make them die.  Always when I o/c a bit too much Smiley  I guess I will experiment with crashing a miner on purpose and see if coldreboot or the /proc/sys stuff helps at all.  If there is value in something like this I will add it.

I like atitweak quite a bit, thanks for the tip.  I'm working on adding it in the next fix, and also adding it's voltage manipulation capabilities as settings in bamt.conf.  I'm also adding a user hook so you can just run any script/command you like per GPU during the O/C phase, which should allow easier customization for people who want to do really crazy stuff.

hero member
Activity: 616
Merit: 506
Ok.. ask and ye shall receive, sometimes...

Fix #9 is experimental until I get some feedback that #1 it works ok and #2 it is sufficient to solve the need for failing back.  You can get it using the fixer.

This fix adds "failback" support for pool files.  You can enable this by
adding one or two numbers to the end of the URLs in your pools file,   
seperated with commas.  For example:

http://u:[email protected]:8332/,10,60         

The first number is the maximum shares to complete before exiting. The
second is an optional override for the default pool_timeout in bamt.conf
You can omit the max shares if you only want to override pool_timeout by
leaving the number blank, ie: url,,number

So basically, if you want to mostly mine at the first pool in your file, but fail over to another pool if #1 is down, then try to go back to #1 every so often... put this in your pools file:

http://u:[email protected]
http://u:[email protected],10,60

This would make the miner try to switch back to the first pool every 10 shares, or if the backup didn't generate a share in 60 seconds.

Its a simple solution that didn't require any big changes in phoenix or complicated logic, which I like.  If this is an acceptable approach and it tests OK, I'll switch it from testing to a mainline fix.

donator
Activity: 798
Merit: 500
I've most always been able to reboot remotely with a hung card by using force reboot command in Linuxcoin.  As root issue:
Code:
echo 1 > /proc/sys/kernel/sysrq
echo b > /proc/sysrq-trigger

sudo coldreboot works too, I think you just have to be sure to kill the process with killall -9 python or kill -9 and wait for the crash.

Also I know it's easy enough for user to add, but maybe atitweak would be a good addition to future versions of BAMT.  It's much easier to use than aticonfig and AMDoverdriveCtrl for core, mem, volt, and fan settings.

http://forum.bitcoin.org/index.php?topic=25750.0
hero member
Activity: 728
Merit: 501
CryptoTalk.Org - Get Paid for every Post!
can you add some new settings, for example when a miner not working for some minutes, then reboot the system.

i find some cards not working for long time.

you've probably overclocked your cards a bit too much.  if it was just phoenix having issues, they would die and move to the next pool (or wrap around to first pool).  I would slow them down a little, chances are you won't have this problem any more.

at least for me, when a card locks up, a reboot doesn't help anyway, in fact it's worse because the hung card will prevent the system from coming back online.  have to reset via reset switch or even power off to get them alive again.  I might add some kind of auto reboot at some point, but I think you'll find that in many cases that would just lead to all your cards being down in the machine as it can't come back up, instead of the one being hung.

I really find sometimes, shutdown -r 0 seems not work, it seems system hangup sometimes.

I've had that happen, was due to 1 of the cards being overclocked too much. Once I found the right overclocks, I've run BAMT for days without issue. Only time I take it down is when I'm messing around with it. If P
sr. member
Activity: 322
Merit: 250
can you add some new settings, for example when a miner not working for some minutes, then reboot the system.

i find some cards not working for long time.

you've probably overclocked your cards a bit too much.  if it was just phoenix having issues, they would die and move to the next pool (or wrap around to first pool).  I would slow them down a little, chances are you won't have this problem any more.

at least for me, when a card locks up, a reboot doesn't help anyway, in fact it's worse because the hung card will prevent the system from coming back online.  have to reset via reset switch or even power off to get them alive again.  I might add some kind of auto reboot at some point, but I think you'll find that in many cases that would just lead to all your cards being down in the machine as it can't come back up, instead of the one being hung.

I really find sometimes, shutdown -r 0 seems not work, it seems system hangup sometimes.
hero member
Activity: 616
Merit: 506
another suggestion is:

when primary miner not working, it will switch to backup miner, after some shares, it will try to connect to primary miner again.

yes, this is coming soon.  I have been reluctant to make large changes to Phoenix, but doing this at that level would require it.  I might implement some sort of intelligent proxy instead, but again that's a lot of changes.  i am open to suggestions on how people would like to see this, or whether it is important.
hero member
Activity: 616
Merit: 506
can you add some new settings, for example when a miner not working for some minutes, then reboot the system.

i find some cards not working for long time.

you've probably overclocked your cards a bit too much.  if it was just phoenix having issues, they would die and move to the next pool (or wrap around to first pool).  I would slow them down a little, chances are you won't have this problem any more.

at least for me, when a card locks up, a reboot doesn't help anyway, in fact it's worse because the hung card will prevent the system from coming back online.  have to reset via reset switch or even power off to get them alive again.  I might add some kind of auto reboot at some point, but I think you'll find that in many cases that would just lead to all your cards being down in the machine as it can't come back up, instead of the one being hung.
sr. member
Activity: 322
Merit: 250
another suggestion is:

when primary miner not working, it will switch to backup miner, after some shares, it will try to connect to primary miner again.
sr. member
Activity: 322
Merit: 250
can you add some new settings, for example when a miner not working for some minutes, then reboot the system.

i find some cards not working for long time.
hero member
Activity: 616
Merit: 506
good deal.   FYI (and anyone else reading who wonders about applying updates) you can edit or just delete /opt/bamt/fix.history to get the fixer to apply them again.  Delete the line for any individual fix you want to redo (order of lines in the file doesn't matter at all), or delete whole file to do them all again.

So, one can just reapply the patch without breaking anything because it was already applied?

Yes, there is no harm at all in re-applying them, although i'd recommend doing X through the current end of the series rather than just picking one out of the middle by itself, since some later updates fix things in the same files as earlier updates. 
sr. member
Activity: 435
Merit: 250
good deal.   FYI (and anyone else reading who wonders about applying updates) you can edit or just delete /opt/bamt/fix.history to get the fixer to apply them again.  Delete the line for any individual fix you want to redo (order of lines in the file doesn't matter at all), or delete whole file to do them all again.

So, one can just reapply the patch without breaking anything because it was already applied?
That's awesome.
I guess i have a trauma with ' Reversed (or previously applied) patch detected!  Assume -R? [n] '.. Wink
hero member
Activity: 616
Merit: 506
Any chance you haven't applied the current fixes yet?  there were some boneheaded mistakes in the munin config that should have been fixed... let me know if the issue remains after (or if) you've applied them, and a restart to munin-node (/etc/init.d/munin-node restart).  Takes up to 5 minutes for the cron job to fire and recollect data.  

Indeed, i must have at some point in time foobared the updates.
What i did now - fresh start, applied patches /after/ coffee this time, which means i applied them from 1 to 8 (no idea what lack of sleep made me do last time, heh) restarted everything.

Anddddd... what a nice Munin it is now.

Thank you for your fast reply and sorry for the false alarm.
Cheers, keep up the awesome (!) work.

good deal.   FYI (and anyone else reading who wonders about applying updates) you can edit or just delete /opt/bamt/fix.history to get the fixer to apply them again.  Delete the line for any individual fix you want to redo (order of lines in the file doesn't matter at all), or delete whole file to do them all again.
sr. member
Activity: 435
Merit: 250
Any chance you haven't applied the current fixes yet?  there were some boneheaded mistakes in the munin config that should have been fixed... let me know if the issue remains after (or if) you've applied them, and a restart to munin-node (/etc/init.d/munin-node restart).  Takes up to 5 minutes for the cron job to fire and recollect data.  

Indeed, i must have at some point in time foobared the updates.
What i did now - fresh start, applied patches /after/ coffee this time, which means i applied them from 1 to 8 (no idea what lack of sleep made me do last time, heh) restarted everything.

Anddddd... what a nice Munin it is now.

Thank you for your fast reply and sorry for the false alarm.
Cheers, keep up the awesome (!) work.
hero member
Activity: 616
Merit: 506
Amazing stuff.

As far as I can see, the only thing that is not working properly (for me, at least) is part of the munin update (and the graph is empty, too, of course):

Code:
2011/07/25 11:47:08 [WARNING] Service gputemp0 on localhost.localdomain/127.0.0.1:4949 returned no data for label temp
2011/07/25 11:47:08 [WARNING] 1 lines had errors while 0 lines were correct in data from 'fetch gputemp1' on localhost.localdomain/127.0.0.1:4949

Which is weird, because telneting to 4949 works fine:

Code:
fetch gputemp0
temp 72.50

Any ideas?

Also: donated. Wink

Any chance you haven't applied the current fixes yet?  there were some boneheaded mistakes in the munin config that should have been fixed... let me know if the issue remains after (or if) you've applied them, and a restart to munin-node (/etc/init.d/munin-node restart).  Takes up to 5 minutes for the cron job to fire and recollect data.  
sr. member
Activity: 435
Merit: 250
Answering myself: gputmp was returning 'temp' and not 'temp.value', although I have no idea why it was wrong. Smiley
sr. member
Activity: 435
Merit: 250
Amazing stuff.

As far as I can see, the only thing that is not working properly (for me, at least) is part of the munin update (and the graph is empty, too, of course):

Code:
2011/07/25 11:47:08 [WARNING] Service gputemp0 on localhost.localdomain/127.0.0.1:4949 returned no data for label temp
2011/07/25 11:47:08 [WARNING] 1 lines had errors while 0 lines were correct in data from 'fetch gputemp1' on localhost.localdomain/127.0.0.1:4949

Which is weird, because telneting to 4949 works fine:

Code:
fetch gputemp0
temp 72.50

Any ideas?

Also: donated. Wink
hero member
Activity: 616
Merit: 506
Fix #8 makes gpumon even better...



You can add pool info for bitclockers, btcguild, deepbit or slush by adding your API key to bamt.conf in the settings section, like this:

Code:
settings:
  #purely cosmetic, used in alerts, etc
  miner_id: myminer
  miner_loc: my mom's basement

  apikey_bitclockers: XXXXXXXXXXXXXXXXXXXXXXXXXXX
  apikey_deepbit: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  apikey_btcguild: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  apikey_slush: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  ...

Any or all are optional.  If there are other pools you would like to have, just let me know.

Also, this version of gpumon will let you press a number to be attached to that #'s miner session (press Ctrl-A then d to return to gpumon).  Press R to restart all your miners, f to run the auto fixer, c to edit bamt.conf, p to edit pools.

enjoy.
Pages:
Jump to: