Monitoring plugin check_smart 6.11 released: NVMe attributes with dots and order output by priority/criticality

Written by - 0 comments

Published on October 4th 2021 - Listed in Monitoring Hardware


A new version of check_smart, an open source monitoring plugin to monitor the health of hard drives, solid state drives and NVMe drives, is now available!

Release 6.11 adds two improvements to the plugin.

Handle dots in NVMe attribute names

This problem was reported in issue 62 on GitHub. Certain NVMe drives show attributes with a dot in the name.

Nvme_0 0 OK: Drive SAMSUNG MZVLB512HAJQ-00000 S/N XXX: no SMART errors detected. |Temperature=25 Available_Spare=100 Available_Spare_Threshold=10 Percentage_Used=22 Data_Units_Read=35774467 Data_Units_Written=280451586 Host_Read_Commands=637677302 Host_Write_Commands=2270597693 Controller_Busy_Time=5846 Power_Cycles=22 Power_On_Hours=1268 Unsafe_Shutdowns=7 Media_and_Data_Integrity_Errors=0 Error_Information_Log_Entries=10 Warning__Comp._Temperature_Time=0 Critical_Comp._Temperature_Time=0 Temperature_Sensor_1=25 Temperature_Sensor_2=34

This may cause problems when automatically creating graphs from the performance data.

6.11 now internally removes the dots in the attribute names. Warning__Comp._Temperature_Time therefore becomes Warning__Comp_Temperature_Time.

Prioritize output by criticality

When running check_smart with the -g parameter (to check multiple drives at the same time), the plugin would simply return all drives with a "non-ok" state in the order they were parsed. This also means that the plugin did not differ between drives with a warning state and drives with a critical state.

As discussed in issue 70 with reporter Peter Newman, the best behaviour would be to first show all the "critical drives", then followed by "warning drives" and finally the "ok drives".

Version 6.11 internally handles the drives differently now. Instead of using "non-ok drives" and "ok drives", the non-ok drives are now split into "critical drives" and "warning drives". This allows a different priority and different sorting of these drives.

Another change was made for attributes which are using a warning threshold (using the -w parameter). If the threshold is not yet reached, the affected attribute is now handled as "notice". An example can be seen in the following case.

Before version 6.11.0, attributes would show up in their lookup order, even when different thresholds are given:

root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500"
WARNING: Drive  WD2000FYYZ-23UL S/N XXX:  Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500), Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47246 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2

Note that the attribute Reallocated_Sector_Ct has a threshold of 500, which is not yet reached. Yet this attribute shows up at the beginning of the plugin's output - because the output is in the same order as the list of attributes (use --debug parameter to see the list of attributes of the relevant drive).

Starting with 6.11.0, the output is now sorted. The Reallocated_Sector_Ct now shows up last, as it is considered as "notice" only:

root@server:~# ./check_smart.pl -d /dev/sg3 -i sat -w "Reallocated_Sector_Ct=500"
WARNING: Drive  WD2000FYYZ-23UL S/N XXX:  Reallocated_Event_Count is non-zero (36), Current_Pending_Sector is non-zero (2), Offline_Uncorrectable is non-zero (2), Reallocated_Sector_Ct is non-zero (372) (but less than threshold 500)|Raw_Read_Error_Rate=0 Spin_Up_Time=4500 Start_Stop_Count=79 Reallocated_Sector_Ct=372 Seek_Error_Rate=0 Power_On_Hours=47247 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=35 G-Sense_Error_Rate=1 Power-Off_Retract_Count=30 Load_Cycle_Count=48 Temperature_Celsius=36 Hardware_ECC_Recovered=0 Reallocated_Event_Count=36 Current_Pending_Sector=2 Offline_Uncorrectable=2 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=2


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.