Zabbix unreachable hosts

The problem

Every now and then, a host in your Zabbix system will turn unavailable. It results in notifications like this:

Trigger: Zabbix agent on myhost is unreachable for 5 minutes
Trigger status: PROBLEM

And logs like this:

1461:20140808:074300.517 Zabbix agent item "vfs.fs.size[/home/deploy/sites/mc/shared/voip,free]" on host "myhost" failed: first network error, wait for 15 seconds
1466:20140808:074345.134 temporarily disabling Zabbix agent checks on host "myhost": host unavailable

What could be a cause? No, it’s not because a host could not be reached. That would be too easy.

A quote from Zabbix documentation:

A host is treated as unreachable after a failed agent check (network error, timeout).

After the UnreachablePeriod ends and the host has not reappeared, the host is treated as unavailable.

Let me decipher this for you: when a single check fails, the whole host will be considered “unavailable”, and will not be monitored anymore.

An example of such check could be vfs.fs.size of a network share that has gone stale. You lose all data and all monitoring of the host until you fix that single check.

Bad design. Bad. Really bad. (I hope some Zabbix developer will read this)

The solution

There’s none, actually. A workaround is to track down such checks and replace them with more reliable UserParameters. In the example with network share something like this could be used instead of vfs.fs.size:

set +m -o pipefail; timeout -s 9 3 df -h | grep /my/mount/point | awk '{print $5}' | grep '%' | tr -d '%' || echo 100

It shall output a “Used” value in %’s within 3 seconds (or report full partition, so you’ll get an alert anyway). Not a real solution, but at least something to do.