Nvidia + Zabbix
What is it
Zabbix is an open-source monitoring software tool for diverse IT components, including networks, servers, virtual machines (VMs) and cloud services. Zabbix provides monitoring metrics, among others network utilization, CPU load and disk space consumption. Zabbix monitoring configuration can be done using XML based templates which contain elements to monitor. The software monitors operations on Linux, Hewlett Packard Unix (HP-UX), Mac OS X, Solaris and other operating systems (OSes); however, Windows monitoring is only possible through agents.
Requirements
We have server with 2 GPU Nvidia. We use neural networnks. We want to monitoring GPU it's loaded.
How to do it
First of all we hav to install packages and configure how get statistic data.
Instalation GPU Stat
NVidia stock software is awful. We have to install gpustat. There are two ways to installation, you can choose any.
- First way:
sudo apt install gpustat
- Second way:
sudo pip install gpustat
Configure cron task
We have to create some script for cron, because zabbix has some timeouts. And them schedule.
crontab -l
* * * * * /usr/local/bin/gpustat --json > /storage/docker-zabbix-agent/zbx_env/var/lib/zabbix/scripts/log/gpu_all.log
After that we have a file with json output
{
"hostname": "serverwithgpu",
"query_time": "2020-08-25T11:30:01.756647",
"gpus": [
{
"index": 0,
"uuid": "GPU-b25d4db2-6730-ed49-394d-27e72110a700",
"name": "GeForce RTX 2080 Ti",
"temperature.gpu": 46,
"fan.speed": 37,
"utilization.gpu": 0,
"power.draw": 56,
"enforced.power.limit": 250,
"memory.used": 9368,
"memory.total": 11019,
"processes": [
]
},
{
"index": 1,
"uuid": "GPU-e7f907fc-4d00-4f4f-dca3-1663ff9616d8",
"name": "GeForce RTX 2080 Ti",
"temperature.gpu": 45,
"fan.speed": 35,
"utilization.gpu": 0,
"power.draw": 64,
"enforced.power.limit": 250,
"memory.used": 8058,
"memory.total": 11019,
"processes": [
]
}
]
}
Some changes
But I want to change something. It's the type of time in the key of "query_time" I made the python script wich conwerts ISO time to UNIX timestamp
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import sys,json
from datetime import datetime
if __name__ == "__main__":
data = json.load(sys.stdin)
d = data['query_time']
data['query_time'] = round(datetime.strptime(d, '%Y-%m-%dT%H:%M:%S.%f').timestamp())
print(json.dumps(data))
I call the python script and give it data over pipe. I made the shell script because the script in cron was too long.
#!/bin/sh
export ZSPATH="/storage/docker-zabbix-agent/zbx_env/var/lib/zabbix/scripts"
/usr/local/bin/gpustat --json|${ZSPATH}/gpu.py > ${ZSPATH}/log/gpu_all.log
And I call the shell script from cron.
crontab -l
* * * * * /storage/docker-zabbix-agent/zbx_env/var/lib/zabbix/scripts/crongpy.sh > /dev/null 2>&1
After that I have the json with normal query_time value.
[
{
"hostname": "serverwithgpu",
"query_time": 1599202023,
"gpus": [
{
"index": 0,
"uuid": "GPU-b25d4db2-6730-ed49-394d-27e72110a700",
"name": "GeForce RTX 2080 Ti",
"temperature.gpu": 29,
"fan.speed": 32,
"utilization.gpu": 0,
"power.draw": 51,
"enforced.power.limit": 250,
"memory.used": 0,
"memory.total": 11019,
"processes": []
},
{
"index": 1,
"uuid": "GPU-e7f907fc-4d00-4f4f-dca3-1663ff9616d8",
"name": "GeForce RTX 2080 Ti",
"temperature.gpu": 28,
"fan.speed": 35,
"utilization.gpu": 0,
"power.draw": 31,
"enforced.power.limit": 250,
"memory.used": 0,
"memory.total": 11019,
"processes": []
}
]
}
]
Configure Zabbix Agent
cat /storage/docker-zabbix-agent/zbx_env/etc/zabbix/zabbix_agentd.d/gpusetj.conf
UserParameter=gpuset[*],cat /var/lib/zabbix/scripts/log/gpu_all.log