check_smart 6.18.0: sudo command path fix, added NVME I/O error detection

Written by - 0 comments

Published on - Listed in Monitoring Perl Coding Hardware Icinga Nagios


A new version of check_smart, a monitoring plugin to monitor hard drives, solid state drives and NVMe drives, is available.

The newest release, 6.18.0, contains a fix and an enhancement, both reported from two individual users.

Fix sudo command path

The mentioned bug likely only occurred on FreeBSD (and maybe other BSD derivatives). Inside the Perl code, the smartctl command is executed with a prefixed path, defined in the @sys_path array:

my @sys_path = qw(/usr/bin /bin /usr/sbin /sbin /usr/local/bin /usr/local/sbin);
my $smart_command = undef;
foreach my $path (@sys_path) {
    if (-x "$path/smartctl") {
        $smart_command = "sudo $path/smartctl";
        last;
    }
}

If you go through the code, the smartctl command is prefixed with the path (once found), but the sudo command is not. It therefore relied on the PATH environment and could lead to sudo command not found.

The new approach is to prefix both commands with the path and merge both into one command ($smart_command).

This bug was reported by Alexey Zonov.

Detect NVME Input/Output Errors

The check_smart.pl plugin relies on the output of the smartctl command in the background. This is nothing new. When smartctl is unable to communicate with the block device, some strange errors are shown up in the output. 

The plugin checks for the output for specific lines, including the "SMART overall-health self-assessment test result" line (for ATA compatible devices). When the relevant health line is not found, the plugin exits with the status UNKNOWN (exit code 3) and with the following output:

$ /usr/lib/nagios/plugins/check_smart.pl -d /dev/nvme1n1 -i nvme
UNKNOWN: Drive  S/N :  No health status line found, |

This doesn't forcibly mean, that the selected drive is dead. It just means, that smartctl's output did not show any health line. This could also happen if a wrong megaraid,N number was selected, pointing to the RAID controller itself instead of a drive. Hence the decision to exit with an UNKNOWN state.

But NVME drives have yet an additional output when smartctl is executed on a defective drive:

$ sudo smartctl -a /dev/nvme1n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.14.0-2-pve] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

Read NVMe Identify Controller failed: NVME_IOCTL_ADMIN_CMD: Input/output error

This output is actually very helpful. It is very likely that this block device (/dev/nvme1n1) indeed has a major problem and is dead. 

We can therefore assume with a high confidence, that check_smart.pl should alert with a CRITICAL message, when this line is detected in smartctl's output.

This kind-of-bug-but-also-feature-request was reported by Robert Scheck in GitHub issue 110.  

Enjoy

And with that: Enjoy the new release. As always, if you encounter bugs, please report them on the GitHub repo.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PHP   Perl   Personal   PostgreSQL   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Znuny   Zoneminder