Zabbix unreachable hosts
The problem
Every now and then, a host in your Zabbix system will turn unavailable. It results in notifications like this:
Trigger: Zabbix agent on myhost is unreachable for 5 minutes Trigger status: PROBLEM
And logs like this:
1461:20140808:074300.517 Zabbix agent item "vfs.fs.size[/home/deploy/sites/mc/shared/voip,free]" on host "myhost" failed: first network error, wait for 15 seconds
1466:20140808:074345.134 temporarily disabling Zabbix agent checks on host "myhost": host unavailable
What could be a cause? No, it’s not because a host could not be reached. That would be too easy.
A quote from Zabbix documentation:
A host is treated as unreachable after a failed agent check (network error, timeout). … After the UnreachablePeriod ends and the host has not reappeared, the host is treated as unavailable.
Let me decipher this for you: when a single check fails, the whole host will be considered “unavailable”, and will not be monitored anymore.
An example of such check could be vfs.fs.size
of a network share that has gone stale. You lose all data and all monitoring of the host until you fix that single check.
Bad design. Bad. Really bad. (I hope some Zabbix developer will read this)
The solution
There’s none, actually. A workaround is to track down such checks and replace them with more reliable UserParameter
s. In the example with network share something like this could be used instead of vfs.fs.size
:
set +m -o pipefail; timeout -s 9 3 df -h | grep /my/mount/point | awk '{print $5}' | grep '%' | tr -d '%' || echo 100
It shall output a “Used” value in %’s within 3 seconds (or report full partition, so you’ll get an alert anyway). Not a real solution, but at least something to do.
Comments