Leaving a party without saying bye: Western Digital Green SSD dead without pre-fail indications

Written by - 0 comments

Published on - Listed in Hardware Monitoring Linux


When a hard or solid state drive dies, it usually happens after defect sectors have already been re-allocated (pre-fail indication). But not always.

Yesterday, our Icinga monitoring (using check_smart monitoring plugin) reported a defect SSD drive on our test and build server:

This is a Western Digital Green SSD with 240GB capacity.

The server's kernel log confirmed:

root@irbwsrvp01 ~ # cat /var/log/kern.log.1
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.160797] ata5.00: exception Emask 0x0 SAct 0x1c00000 SErr 0x0 action 0x0
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.160879] ata5.00: irq_stat 0x40000008
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.160949] ata5.00: failed command: WRITE FPDMA QUEUED
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.161129] ata5.00: status: { DRDY ERR }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.163257] ata5: EH complete
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.183671] ata5.00: exception Emask 0x0 SAct 0x3800000 SErr 0x0 action 0x0
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.183754] ata5.00: irq_stat 0x40000008
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.183824] ata5.00: failed command: WRITE FPDMA QUEUED
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.183903] ata5.00: cmd 61/20:b8:a0:cc:be/00:00:1a:00:00/40 tag 23 ncq dma 16384 out
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.183903]          res 41/04:00:00:cd:be/00:00:1a:00:00/00 Emask 0x401 (device error) <F>
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.184009] ata5.00: status: { DRDY ERR }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.184077] ata5.00: error: { ABRT }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.185015] ata5.00: configured for UDMA/133 (device error ignored)
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.185037] ata5: EH complete
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.235684] ata5.00: exception Emask 0x0 SAct 0x7400000 SErr 0x0 action 0x0
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.235777] ata5.00: irq_stat 0x40000008
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.235859] ata5.00: failed command: WRITE FPDMA QUEUED
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.235946] ata5.00: cmd 61/10:b0:a8:7b:89/00:00:00:00:00/40 tag 22 ncq dma 8192 out
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.235946]          res 41/04:00:08:7c:89/00:00:00:00:00/00 Emask 0x401 (device error) <F>
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.236097] ata5.00: status: { DRDY ERR }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.236177] ata5.00: error: { ABRT }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.237111] ata5.00: configured for UDMA/133 (device error ignored)
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.263746] ata5.00: NCQ disabled due to excessive errors
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.263751] ata5.00: exception Emask 0x0 SAct 0x40000007 SErr 0x0 action 0x0
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.263855] ata5.00: irq_stat 0x40000008
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.263937] ata5.00: failed command: WRITE FPDMA QUEUED
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.264025] ata5.00: cmd 61/20:f0:a0:cc:be/00:00:1a:00:00/40 tag 30 ncq dma 16384 out
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.264025]          res 41/04:00:00:cd:be/00:00:1a:00:00/00 Emask 0x401 (device error) <F>
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.264176] ata5.00: status: { DRDY ERR }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.264256] ata5.00: error: { ABRT }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.265189] ata5.00: configured for UDMA/133 (device error ignored)
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.283631] ata5.00: irq_stat 0x40000001
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.283947] ata5.00: status: { DRDY ERR }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.319704] ata5.00: irq_stat 0x40000001
[...]

The software raid, managed by mdadm, already marked the drive as failed in the RAID-1 array.

The drive's SMART information showed that the self-assessment showed the drive as FAILED:

=== START OF INFORMATION SECTION ===
Device Model:     WDC WDS240G2G0A-00JH30
Serial Number:    1XXXXXXXXXX1
LU WWN Device Id: 5 001b44 8b8882bf8
Firmware Version: UF500000
User Capacity:    240,065,183,744 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jul 21 13:03:30 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

One fact remains very interesting though: No attributes (such as Current_Pending_Sector or Reallocated_Sector_Count) showed any signs of a pre-failure. This can be verified with the monitoring graphs (here Reallocated_Sector_Ct and Reported_Uncorrect are shown):

So this drive just left the party out of nowhere, after a runtime of around 9833 hours (~ 409 days).


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.