Monitoring plugin check_smart 5.11 available, introducing exclude list

Written by - 0 comments

Published on - Listed in Monitoring Hardware Linux BSD Icinga Nagios


The monitoring plugin check_smart, to monitor hard drives' and solid state drives' SMART attributes, is out with a new version.

Version 5.11 introduces a new parameter "-e" or "--exclude" which stands for exclude list (aka ignore list).

The exclude list is a list of strings, separated by comma. The exclude list basically tells the plugin which SMART attributes to ignore, even if they are in a failing or failed state.

Let's take a temperature failed in the past error as an example.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
194 Temperature_Celsius     0x0002   113   113   000    Old_age   Always  In_the_past  53 (Lifetime Min/Max 25/62)

Without the exclude list, the plugin will return a WARNING when the temperature SMART attribute once failed in the past:

# ./check_smart.pl -d /dev/sda -i sat
WARNING: Attribute Temperature_Celsius failed at In_the_past|Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

It's a nice info that it once failed in the past. But once we know that, we get over it and want the warning to disappear. With the exclude list, the plugin can be told to ignore this attribute "Temperature_Celsius":

# ./check_smart.pl -d /dev/sda -i sat -e Temperature_Celsius
OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

And hurray, no alert anymore. 

But this could also be a bit dangerous. What if the drive has a new (live!) temperature alert? You'd certainly want to know about it. That's why, besides excluding a SMART attribute, it is also possible to exclude certain values in the "When_failed" column. In the following example, the "When_Failed" value "In_the_past" (as seen above) can be used in the exclude list:

# ./check_smart.pl -d /dev/sda -i sat -e "In_the_past"
OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

As you can see, the plugin doesn't alert anymore on the "Temperature_Celsius" because it detected the "In_the_past" value in the "When_failed" column and successfully ignored it.

To ignore multiple attributes, simply separate them with a comma:

# ./check_smart.pl -d /dev/sda -i sat -e "In_the_past","Current_Pending_Sector"
OK: no SMART errors detected. |Raw_Read_Error_Rate=0 Throughput_Performance=67 Spin_Up_Time=0 Start_Stop_Count=3 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Seek_Time_Performance=34 Power_On_Hours=10617 Spin_Retry_Count=0 Power_Cycle_Count=3 Power-Off_Retract_Count=3 Load_Cycle_Count=3 Temperature_Celsius=53 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0

But you better make sure you're not cutting yourself with this. The main reason why the exclude list was created in the first place is clearly the temperature attribute.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.