Author

Topic: Linux Nvidia Monitoring Script for checkmk nagios (Read 126 times)

member
Activity: 139
Merit: 10
I have been searching for a while for a simple way to monitor nvidia temps (and possibly more) with checkmk nagios.
Found lots of different methods, either way to complicated, not working at all or to hard for me to adapt (I am primarily a windows sys admin, medium skills on linux. Good on powershell, medium to less on everything else)
I am sure there are better, more elegant and simpler ways to achieve what I wanted. This is what I came up with.
The agent actually does detect the core temps by itself, but for some reason this is not being detected on a host rescan.

CheckMK is a free and complete monitoring solution I use professionally and in private.
https://checkmk.com/cms_install_packages.html

This script can be put in the checkmk agent local folder /usr/bin/check_mk_agent/local and will be executed by the agent.
Only thing that needs to be adapted is your preferred temp range.
The output is what checkmk requires.
0 normal
1 warning
2 crit

Result looks like this (everything else beside GPU is built in ofc, Miner process is a custom discovery rule)
https://i.imgur.com/noqwFz9.png

Maybe someone finds this useful.
Please take, copy, improve, whatever...

# Code
count=`nvidia-smi --query-gpu=index --format=csv,noheader`

for index in $count
do

gpu_temp=`nvidia-smi -i $index --query-gpu=temperature.gpu --format=csv,noheader`
gpu_fan=`nvidia-smi -i $index --query-gpu=fan.speed --format=csv,noheader`
gpu_name=`nvidia-smi -i $index --query-gpu=gpu_name --format=csv,noheader`
gpu_power=`nvidia-smi -i $index --query-gpu=power.draw --format=csv,noheader`

if ((10<=$gpu_temp && $gpu_temp<=70))
then echo "0 GPU$index - $gpu_name TEMP $gpu_temp"C" - FAN $gpu_fan - $gpu_power"

elif ((71<=$gpu_temp && $gpu_temp<=72))
then echo "1 GPU$index - $gpu_name TEMP $gpu_temp"C" - FAN $gpu_fan - $gpu_power"

elif ((73<=$gpu_temp && $gpu_temp<=80))
then echo "2 GPU$index - $gpu_name TEMP $gpu_temp"C" - FAN $gpu_fan - $gpu_power"

else echo "2 GPU$index - UNKNOWN"
fi

done

#Code

Jump to: