Leaving a party without saying bye: Western Digital Green SSD dead without pre-fail indications

Written by - 0 comments

Published on - Listed in Hardware Monitoring Linux


When a hard or solid state drive dies, it usually happens after defect sectors have already been re-allocated (pre-fail indication). But not always.

Yesterday, our Icinga monitoring (using check_smart monitoring plugin) reported a defect SSD drive on our test and build server:

This is a Western Digital Green SSD with 240GB capacity.

The server's kernel log confirmed:

root@irbwsrvp01 ~ # cat /var/log/kern.log.1
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.160797] ata5.00: exception Emask 0x0 SAct 0x1c00000 SErr 0x0 action 0x0
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.160879] ata5.00: irq_stat 0x40000008
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.160949] ata5.00: failed command: WRITE FPDMA QUEUED
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.161129] ata5.00: status: { DRDY ERR }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.163257] ata5: EH complete
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.183671] ata5.00: exception Emask 0x0 SAct 0x3800000 SErr 0x0 action 0x0
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.183754] ata5.00: irq_stat 0x40000008
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.183824] ata5.00: failed command: WRITE FPDMA QUEUED
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.183903] ata5.00: cmd 61/20:b8:a0:cc:be/00:00:1a:00:00/40 tag 23 ncq dma 16384 out
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.183903]          res 41/04:00:00:cd:be/00:00:1a:00:00/00 Emask 0x401 (device error) <F>
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.184009] ata5.00: status: { DRDY ERR }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.184077] ata5.00: error: { ABRT }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.185015] ata5.00: configured for UDMA/133 (device error ignored)
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.185037] ata5: EH complete
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.235684] ata5.00: exception Emask 0x0 SAct 0x7400000 SErr 0x0 action 0x0
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.235777] ata5.00: irq_stat 0x40000008
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.235859] ata5.00: failed command: WRITE FPDMA QUEUED
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.235946] ata5.00: cmd 61/10:b0:a8:7b:89/00:00:00:00:00/40 tag 22 ncq dma 8192 out
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.235946]          res 41/04:00:08:7c:89/00:00:00:00:00/00 Emask 0x401 (device error) <F>
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.236097] ata5.00: status: { DRDY ERR }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.236177] ata5.00: error: { ABRT }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.237111] ata5.00: configured for UDMA/133 (device error ignored)
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.263746] ata5.00: NCQ disabled due to excessive errors
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.263751] ata5.00: exception Emask 0x0 SAct 0x40000007 SErr 0x0 action 0x0
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.263855] ata5.00: irq_stat 0x40000008
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.263937] ata5.00: failed command: WRITE FPDMA QUEUED
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.264025] ata5.00: cmd 61/20:f0:a0:cc:be/00:00:1a:00:00/40 tag 30 ncq dma 16384 out
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.264025]          res 41/04:00:00:cd:be/00:00:1a:00:00/00 Emask 0x401 (device error) <F>
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.264176] ata5.00: status: { DRDY ERR }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.264256] ata5.00: error: { ABRT }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.265189] ata5.00: configured for UDMA/133 (device error ignored)
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.283631] ata5.00: irq_stat 0x40000001
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.283947] ata5.00: status: { DRDY ERR }
Jul 20 20:23:13 irbwsrvp01 kernel: [5642310.319704] ata5.00: irq_stat 0x40000001
[...]

The software raid, managed by mdadm, already marked the drive as failed in the RAID-1 array.

The drive's SMART information showed that the self-assessment showed the drive as FAILED:

=== START OF INFORMATION SECTION ===
Device Model:     WDC WDS240G2G0A-00JH30
Serial Number:    1XXXXXXXXXX1
LU WWN Device Id: 5 001b44 8b8882bf8
Firmware Version: UF500000
User Capacity:    240,065,183,744 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jul 21 13:03:30 2020 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
No failed Attributes found.

One fact remains very interesting though: No attributes (such as Current_Pending_Sector or Reallocated_Sector_Count) showed any signs of a pre-failure. This can be verified with the monitoring graphs (here Reallocated_Sector_Ct and Reported_Uncorrect are shown):

So this drive just left the party out of nowhere, after a runtime of around 9833 hours (~ 409 days).


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder