Disk failures are not always seen by S.M.A.R.T.

Written by Claudio Kuenzler - 1 comments

Published on November 4th 2012 - Listed in Linux Hardware Rant

This weekend I had a disk failure in a server. Problem: The disk was only partially failing.
If the disk would have been completely defect, raid would have just disabled the disk and everything continued on the second disk. But both SMART and mdraid thought the disk was still OK. Therefore, as soon as the first disk problems hit the system, mdraid always tried to resynchronize the disks, causing enormous load (up to 100!) whenever a bad sector was hit. This made the server completely useless.

After some research I found the following entries in kernel.log which clearly indicated a problem on the first disk:

Nov 2 21:59:06 server kernel: [7087493.950091] ata1.00: exception Emask 0x0 SAct 0xf SErr 0x0 action 0x6 frozen
Nov 2 21:59:06 server kernel: [7087493.950141] ata1.00: failed command: WRITE FPDMA QUEUED
Nov 2 21:59:06 server kernel: [7087493.950173] ata1.00: cmd 61/08:00:a8:5b:94/00:00:36:00:00/40 tag 0 ncq 4096 out
Nov 2 21:59:06 server kernel: [7087493.950174]          res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
Nov 2 21:59:06 server kernel: [7087493.950260] ata1.00: status: { DRDY }
Nov 2 21:59:06 server kernel: [7087493.950283] ata1.00: failed command: WRITE FPDMA QUEUED
Nov 2 21:59:06 server kernel: [7087493.950313] ata1.00: cmd 61/08:08:30:47:4a/00:00:04:00:00/40 tag 1 ncq 4096 out
Nov 2 21:59:06 server kernel: [7087493.950315]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 2 21:59:06 server kernel: [7087493.950400] ata1.00: status: { DRDY }
Nov 2 21:59:06 server kernel: [7087493.950423] ata1.00: failed command: WRITE FPDMA QUEUED
Nov 2 21:59:07 server kernel: [7087493.950453] ata1.00: cmd 61/08:10:b0:5b:94/00:00:36:00:00/40 tag 2 ncq 4096 out
Nov 2 21:59:07 server kernel: [7087493.950455]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Nov 2 21:59:07 server kernel: [7087493.950540] ata1.00: status: { DRDY }
Nov 2 21:59:07 server kernel: [7087493.950564] ata1.00: failed command: WRITE FPDMA QUEUED
Nov 2 21:59:07 server kernel: [7087493.950593] ata1.00: cmd 61/08:18:a0:a7:f9/00:00:35:00:00/40 tag 3 ncq 4096 out
Nov 2 21:59:07 server kernel: [7087493.950595]          res 40/00:00:40:25:10/00:04:36:00:00/00 Emask 0x4 (timeout)
Nov 2 21:59:07 server kernel: [7087493.950680] ata1.00: status: { DRDY }
Nov 2 21:59:07 server kernel: [7087493.950709] ata1: hard resetting link
Nov 2 21:59:07 server kernel: [7087494.493817] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 2 21:59:07 server kernel: [7087494.960980] ata1.00: configured for UDMA/133
Nov 2 21:59:07 server kernel: [7087494.961009] ata1.00: device reported invalid CHS sector 0
Nov 2 21:59:07 server kernel: [7087494.961036] ata1.00: device reported invalid CHS sector 0
Nov 2 21:59:07 server kernel: [7087494.961063] ata1.00: device reported invalid CHS sector 0
Nov 2 21:59:07 server kernel: [7087494.961097] ata1: EH complete
Nov 2 23:44:04 server kernel: [7093789.503985] INFO: task jbd2/dm-2-8:970 blocked for more than 120 seconds.
Nov 2 23:44:04 server kernel: [7093789.504018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 2 23:44:04 server kernel: [7093789.504063] jbd2/dm-2-8   D 0000000000000002     0   970      2 0x00000000
Nov 2 23:44:04 server kernel: [7093789.504097] ffff88043b3df810 0000000000000046 000000039d742300 ffff880439ec4f10
Nov 2 23:44:04 server kernel: [7093789.504145] ffff88043adb6900 0000000000000000 000000000000f9e0 ffff88043a4c9fd8
Nov 2 23:44:04 server kernel: [7093789.504193] 0000000000015780 0000000000015780 ffff88043b3dcdb0 ffff88043b3dd0a8
Nov 2 23:44:04 server kernel: [7093789.504240] Call Trace:
Nov 2 23:44:04 server kernel: [7093789.504273] [] ? dm_table_unplug_all+0x4b/0xb4 [dm_mod]
Nov 2 23:44:04 server kernel: [7093789.504308] [] ? read_tsc+0xa/0x20
Nov 2 23:44:04 server kernel: [7093789.504337] [] ? sync_buffer+0x0/0x40
Nov 2 23:44:04 server kernel: [7093789.504366] [] ? io_schedule+0x73/0xb7
Nov 2 23:44:04 server kernel: [7093789.504394] [] ? sync_buffer+0x3b/0x40
Nov 2 23:44:04 server kernel: [7093789.504422] [] ? __wait_on_bit+0x41/0x70
Nov 2 23:44:04 server kernel: [7093789.504450] [] ? sync_buffer+0x0/0x40
Nov 2 23:44:04 server kernel: [7093789.504478] [] ? out_of_line_wait_on_bit+0x6b/0x77
Nov 2 23:44:04 server kernel: [7093789.504509] [] ? wake_bit_function+0x0/0x23
Nov 2 23:44:04 server kernel: [7093789.504541] [] ? jbd2_journal_commit_transaction+0xbe5/0x115b [jbd2]
Nov 2 23:44:04 server kernel: [7093789.504591] [] ? kjournald2+0xbe/0x206 [jbd2]
Nov 2 23:44:04 server kernel: [7093789.504620] [] ? autoremove_wake_function+0x0/0x2e
Nov 2 23:44:04 server kernel: [7093789.504652] [] ? kjournald2+0x0/0x206 [jbd2]
Nov 2 23:44:04 server kernel: [7093789.504681] [] ? kthread+0x79/0x81
Nov 2 23:44:04 server kernel: [7093789.504709] [] ? child_rip+0xa/0x20
Nov 2 23:44:04 server kernel: [7093789.504735] [] ? kthread+0x0/0x81
Nov 2 23:44:04 server kernel: [7093789.504762] [] ? child_rip+0x0/0x20

From then on, the entries with "INFO: task XYZ blocked for more than 120 seconds" were added hundreds of times in the log.

At the same time I also checked the SMART status of the affected disk ata1.00 (that's SDA). But no problems were found. But I knew that in the past I've already received SMART errors on exactly that disk which disappeared by themselves after some time (?? WTF ??).

As soon as I took /dev/sda out of the raid, the system went back to normal and was usable again. After replacement of the disk, the raid synchronization started up and after that the system ran smooth again.

Another hint of such a disk error is when one compares the disk IO of all disks in the same raid. The following graph clearly shows that the IO is different (it should have exactly the same graph on both disks):

Comparing Disk IO on two disks in same raid

Today's lesson: Don't rely only on SMART or raid status to determine if a disk has a failure.

Add a comment

Show form to leave a comment

Comments (newest first)

Tim from Edmonton wrote on Nov 4th, 2012:

Huh, thanks for the info. If nothing else, it reminds me to change the monitoring mode in check_mk to monitor each drive individually instead of taking an overall IO summary.

Blog Tags:

AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder