Author

Topic: Nagios plugin for monitoring GPU temperature and fan speed? (Read 7700 times)

newbie
Activity: 34
Merit: 0
I may throw some BTC at a person who could write a nagios plugin that would monitor the temperature using aticonfig utility. Anyone else interested in getting such a thing?

Send me a PM, I've already got the gpu temps into snmpd on linux and getting that into opennms or nagios would be pretty straight forward.
sr. member
Activity: 360
Merit: 250
This is working for me for the GPU temps. You'll need to set up sudoers correctly and chance instances of "syadasti" to whatever userid generally runs your miners.

Haven't yet seen if I can graph off the perfdata, but that's next.

Code:
#!/bin/bash

export DISPLAY=:0
export LD_LIBRARY_PATH=/opt/ati-stream-sdk-v2.3-lnx64/lib/x86_64/
export ATISTREAMSDKROOT=/opt/ati-stream-sdk-v2.3-lnx64
export GPU_USE_SYNC_OBJECTS=1

exit_status=0
serviceoutput=
serviceperfdata=
longserviceoutput=
templist=

for f in `sudo -u syadasti aticonfig --list-adapters | grep : | tr \* ' ' | sed 's/\..*//'`
do
out=`sudo -u syadasti aticonfig --adapter=$f --odgt`
temp=`echo $out | grep Temp | sed -e 's/^.*- .* - //' | sed -e 's/ C.*//'`
templist="$templist $temp"
longserviceoutput="${longserviceoutput}#${f} $temp C;"
#echo $longserviceoutput
serviceperfdata="${serviceperfdata}#${f}=${temp};"
#echo $serviceperfdata
tempint=`echo $temp | sed 's/\..*//'`
if [ $tempint -gt 94 ]
then
exit_status=1
fi
done

if [ $exit_status -ne 0 ]
then
serviceoutput="GPU TEMP WARNING"
else
serviceoutput="GPU TEMP OK"
fi
serviceoutput="$serviceoutput - Temps: / $templist;"

#echo long: $longserviceoutput
#echo serviceperf: $serviceperfdata

echo $serviceoutput \|
echo -n $longserviceoutput \| | sed 's/^#//' | sed 's/#/\n/g'
echo $serviceperfdata | sed 's/;$//' | sed 's/^#/ /'| sed 's/#/\n/g'

exit $exit_status
sr. member
Activity: 434
Merit: 251
Every saint has a past. Every sinner has a future.
I will give munin a try, however I would prefer nagios since I use it to monitor my whole network (which is pretty big).
I've only just this minute looked at Naigos (used Munin before, however) but it looks to me as if you could fairly easily hack comboy's script. If I read the documentation correctly, Naigos wants a plugin that simply spits out something like "GPU Temperature: 75.4". You could do that with comboy's script by removing everything except the else-block. (And maybe a bit of ruby-hackery to pretty-print the output for Nagios/human-consumption).

...but I know nothing about Nagios, YMMV, IANAL, consult a qualified physician before commencing exercise, do not taunt Happy Fun Ball, etc.


Thanks for that, LMGTFY. I never looked at writing Nagios plugins, I may give it a shot!
hero member
Activity: 644
Merit: 503
I will give munin a try, however I would prefer nagios since I use it to monitor my whole network (which is pretty big).
I've only just this minute looked at Naigos (used Munin before, however) but it looks to me as if you could fairly easily hack comboy's script. If I read the documentation correctly, Naigos wants a plugin that simply spits out something like "GPU Temperature: 75.4". You could do that with comboy's script by removing everything except the else-block. (And maybe a bit of ruby-hackery to pretty-print the output for Nagios/human-consumption).

...but I know nothing about Nagios, YMMV, IANAL, consult a qualified physician before commencing exercise, do not taunt Happy Fun Ball, etc.
newbie
Activity: 20
Merit: 0
Whats your average temperature while minning? (GPU and CPU)
sr. member
Activity: 434
Merit: 251
Every saint has a past. Every sinner has a future.
I will give munin a try, however I would prefer nagios since I use it to monitor my whole network (which is pretty big).
sr. member
Activity: 247
Merit: 252
I prefer Munin as it works more out of the box, but I'll give it a try Cheesy

You would need to tune it for yourself but here's what I have for munin:


Code: (temperature)
#!/usr/bin/ruby

if ARGV.first == "config"
  puts "graph_title GPU temperatures"
  #puts "graph_args --base 1000 -r --lower-limit 0"
  puts "graph_category gpu"
  puts "graph_period second"
  puts "graph_vlabel temperature"
  puts "card0.label card0"
  puts "card1.label card1"
  puts "card2.label card2"
else
  out = `DISPLAY=":3" aticonfig --odgt --adapter=all`
  #puts out

  out.split("\n\n").each do |card|
    adapter = card.match(/Adapter ([0-9.]+)/i)[1]
    temp = card.match(/Temperature - ([0-9.]+) C/i)[1]
    puts "card#{adapter}.value #{temp}"
  end
end


Code: (fan speeds)
#!/usr/bin/ruby

if ARGV.first == "config"
  puts "graph_title GPU fan speeds"
  #puts "graph_args --base 1000 -r --lower-limit 0"
  puts "graph_category gpu"
  puts "graph_period second"
  puts "graph_vlabel speed"
  puts "fan0.label card0"
  puts "fan2.label card2"
else
  %w{0 2}.each do |id|
    out = ` DISPLAY=":3.#{id}" aticonfig --pplib-cmd 'get fanspeed 0'`
    speed = out.match(/Speed: (\d+)\%/)[1]
    puts "fan#{id}.value #{speed}"
  end
end

As you may have guessed this one is from machine that has 1x5970 and 1x5870. I was too lazy to do it using plugin configuration.
hero member
Activity: 489
Merit: 505
I prefer Munin as it works more out of the box, but I'll give it a try Cheesy
sr. member
Activity: 434
Merit: 251
Every saint has a past. Every sinner has a future.
I may throw some BTC at a person who could write a nagios plugin that would monitor the temperature using aticonfig utility. Anyone else interested in getting such a thing?
Jump to: