Invalid performance data from check_disk on CentOS results in empty monitoring graphs

Written by - 0 comments

Published on - Listed in Linux Monitoring Icinga Nagios


Today I was stumbling across a very interesting problem: CentOS 7 machines would not show any disk usage graphs in our Grafana dashboard (which uses InfluxDB as data source and Icinga 2 as monitoring system).

Let's activate troubleshooting mode!

0 value shown in Icingaweb's performance data

The first thought was that this particular server may have a problem reporting back performance data. Could it be a buggy NRPE configuration? Or maybe even a bug in check_disk on this particular server?

But by looking at other CentOS 7 machines in Icingaweb 2, another discovery was made: Actually every CentOS 7 machine did not have any disk graphs!

Looking closer in Icingaweb, the value of the disk usage seems to be 0. On every partition. On every CentOS 7 server.

When looking at an Ubuntu Linux server, the values clearly show up:

Where does this 0.00 value on the CentOS machines come from? Is it the plugin check_disk itself? Or is NRPE the reason the data is not correctly landing in Icinga?

Manually running check_disk

By executing the nagios-plugin check_disk on the server, did not seem to trigger any errors. Certainly no 0.00 values, neither in the output nor in the performance data:

[nagios@centos7 ~]# /usr/lib/nagios/plugins/check_disk -w 15% -c 5% -W 10% -K 5% -p /
DISK OK - free space: / 8640 MiB (76.21% inode=95%);| /=2696MiB;10171;11368;0;11967

Even by executing this exact check command through NRPE seemed to work:

root@icinga:~# /usr/lib/nagios/plugins/check_nrpe -H nzz-wwapp01-d.nzzmg.local -c check_disk -a "15%" "5%" "/" "15%" "5%"
DISK OK - free space: / 8640 MiB (76.21% inode=95%);| /=2696MiB;10171;11368;0;11967

So both ways to execute the plugin (locally and via NRPE) seem to work. Maybe Icinga has a problem parsing the performance data?

Invalid perfdata!

A look into the Icinga logs quickly turned up a WARNING entry:

[2021-02-17 06:27:56 +0100] warning/InfluxdbWriter: Ignoring invalid perfdata for checkable 'centos7!Diskspace /' and command 'nrpe' with value: /=2696MiB;10171;11368;0;11967
Context:
        (0) Processing check result for 'centos7!Diskspace /'

So obviously Icinga somehow thinks that the received performance data is invalid and refuses to write the values into InfluxDB. But why?

Note: At this point I still didn't see the issue. Even for my trained monitoring eye using 15+ years of nagios/monitoring-plugins. ;-)

A research on the Internet led to a discussion in the Icinga forums and from there to the actual source of the problem: A bug in nagios-plugins!

Issue #518 in the nagios-plugins repositories describes a regression bug introduced by nagios-plugins 2.3.0. In the release notes of 2.3.0 the following change was applied to check_disk:

2.3.0 2019-12-04
Note: nagios-plugins 2.3.0 introduced a backward-compatibility issue for performance data graphs produced by check_icmp.
Anyone upgrading from an older version of nagios-plugins should upgrade directly to >=2.3.1.
    ENHANCEMENTS
    Added a plugin "remove_perfdata" to remove perfdata from another plugin's output
    Added a plugin "check_ssl_validity" which verifies a certificate through CRL or OCSP
    Added directory plugins-python containing three Python plugins
    check_apt: Add --only-critical switch
    check_apt: Add --packages-warning (Marc Sánchez)
    check_disk: Add --combined-thresholds to alert on a percentage AND raw units.
    check_disk: Add --inode-perfdata to get extra performance data from inode statistics
    check_disk: Add support for base-10 units kB, MB, GB, TB; rename base-2 units to KiB, MiB, GiB, TiB

[...]

This change caused the performance output units to show MiB instead of MB. However "MiB" is not a valid UOM (Unit of measurement), even described in the Nagios plugin developer guidelines:

UOM (unit of measurement) is a string of zero or more characters, NOT including numbers, semicolons, or quotes. Some examples:
    no unit specified - assume a number (int or float) of things (eg, users, processes, load averages)
    s - seconds (also us, ms)
    % - percentage
    B - bytes (also KB, MB, TB)
    c - a continous counter (such as bytes transmitted on an interface)

The regression was fixed 3 months later in nagios-plugins 2.3.2:

2.3.2 2020-02-20
    FIXES
    build: Fix broken builds on some systems, including Homebrew (#508)
    check_disk: Change unit calculations to always use binary units for backward compatibility (#518)
[...]

It's fixed, all good, right?

Even though the bug was fixed in the upstream project "nagios-plugins" in 2.3.2, the bug is still around in the current nagios-plugins-disk package, installed from the EPEL 7 repositories. Interesting however is the fact that the package version is showing up as 2.3.3:

[root@centos7 ~]# yum info nagios-plugins-disk
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror.init7.net
 * epel: mirror.init7.net
 * extras: mirror.init7.net
 * updates: mirror.init7.net
Installed Packages
Name        : nagios-plugins-disk
Arch        : x86_64
Version     : 2.3.3
Release     : 2.el7
Size        : 86 k
Repo        : installed
From repo   : epel
Summary     : Nagios Plugin - check_disk
URL         : https://www.nagios-plugins.org/
License     : GPLv2+
Description : Provides check_disk support for Nagios.

A bug report was opened in the Fedora EPEL project: Bugzilla 1929568 mentions these findings documented here in this article. Hopefully this bug will shortly be addressed and a new package in the EPEL repositories released.

However further analysis with the latest available upstream release nagios-plugins-2.3.3 reveals that this regression is still happening, although marked as fixed in the changelog:

# ./check_disk -V
check_disk v2.3.3 (nagios-plugins 2.3.3)

# ./check_disk -w 15% -c 5% -W 10% -K 5% -p /
DISK OK - free space: / 8447 MiB (89.03% inode=95%);| /=1040MiB;8512;9514;0;10015

Therefore it's actually still an upstream bug and issue #518 should be re-opened again.

A workaround, please?

Fortunately an immediate workaround is available: By using the -m parameter, the values will be shown in MegaBytes:

[root@centos7 ~]# /usr/lib/nagios/plugins/check_disk -w 15% -c 5% -W 10% -K 5% -p / -m
DISK OK - free space: / 8640 MB (76.21% inode=95%);| /=2696MB;10171;11368;0;1196

This corrects the UOM in the performance data and data can be parsed and written into InfluxDB again.

For NRPE setups, the check command needs to be adjusted accordingly:

[root@centos7 ~]# grep disk /etc/nagios/nrpe_local.cfg
command[check_disk]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -W $ARG4$ -K $ARG5$ -p $ARG3$ -m

And what about monitoring-plugins?

A couple of years ago, the nagios-plugins project split into two projects: nagios-plugins and monitoring-plugins. Some distributions switched their package sources to monitoring-plugins (Debian and Ubuntu for example) but others remained at the "original" nagios-plugins (such as CentOS). The regression bug which caused the change in the UOM was only introduced in nagios-plugins, not in monitoring-plugins. However in EPEL, there are no monitoring-plugins packages available for installation.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder