And another hard drive has said farewell: A Western Digital (WD) Green 2 TB, device model WDC WD20EARX-00PASB0.
This particular drive was used as an original data drive on my NAS (running with Debian Linux on a HP N40L micro server). It was part of a RAID-1 with another 2TB drive and part of a LVM volume group across 4 drives.
Due to the fact that the NAS is not that often used (primarily for backups), the drive(s) were never constantly in use. This is probably the reason, why this particular drive reached a very high age of 63791 power on hours.
Note: The second drive in this RAID-1 array is still living - and the power on hours are still counting on!
I noticed the NAS was hanging during a file copy from my Linux desktop - I even had to cancel the copy process. A couple of minutes later, my monitoring (Icinga 2, checking via NRPE on the NAS) alerted me:
Disk Raid Status on bw-nas is CRITICAL!
Info: CRITICAL: mdstat:[md1(2.73 TiB raid1):UU, md0(1.82 TiB raid1):F::_U]
The monitoring plugin check_raid detected that mdadm took one drive out of the array (md0). The kernel logs reveal that communication with this drive (SDA) was hanging, until mdadm marked the drive as failed and removed it from the array:
root@nas:~# cat /var/log/kern.log
Oct 1 10:35:08 nas kernel: [11560951.464470] ata1.00: exception Emask 0x0 SAct 0x80000 SErr 0x0 action 0x6 frozen
Oct 1 10:35:08 nas kernel: [11560951.464633] ata1.00: failed command: WRITE FPDMA QUEUED
Oct 1 10:35:08 nas kernel: [11560951.464749] ata1.00: cmd 61/02:98:2a:00:00/00:00:00:00:00/40 tag 19 ncq dma 1024 out
Oct 1 10:35:08 nas kernel: [11560951.464749] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 1 10:35:08 nas kernel: [11560951.465033] ata1.00: status: { DRDY }
Oct 1 10:35:08 nas kernel: [11560951.465112] ata1: hard resetting link
Oct 1 10:35:18 nas kernel: [11560961.477221] ata1: softreset failed (device not ready)
Oct 1 10:35:18 nas kernel: [11560961.477341] ata1: hard resetting link
Oct 1 10:35:28 nas kernel: [11560971.489938] ata1: softreset failed (device not ready)
Oct 1 10:35:28 nas kernel: [11560971.490055] ata1: hard resetting link
Oct 1 10:35:32 nas kernel: [11560975.502262] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Oct 1 10:35:32 nas kernel: [11560975.517863] ata1.00: configured for UDMA/133
Oct 1 10:35:32 nas kernel: [11560975.517893] ata1: EH complete
Oct 1 10:36:03 nas kernel: [11561006.764714] ata1.00: exception Emask 0x0 SAct 0x100 SErr 0x0 action 0x6 frozen
Oct 1 10:36:03 nas kernel: [11561006.764871] ata1.00: failed command: WRITE FPDMA QUEUED
Oct 1 10:36:03 nas kernel: [11561006.764987] ata1.00: cmd 61/08:40:58:1c:51/00:00:1d:00:00/40 tag 8 ncq dma 4096 out
Oct 1 10:36:03 nas kernel: [11561006.764987] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 1 10:36:03 nas kernel: [11561006.765270] ata1.00: status: { DRDY }
Oct 1 10:36:03 nas kernel: [11561006.765351] ata1: hard resetting link
Oct 1 10:36:13 nas kernel: [11561016.781456] ata1: softreset failed (device not ready)
Oct 1 10:36:13 nas kernel: [11561016.781573] ata1: hard resetting link
Oct 1 10:36:23 nas kernel: [11561026.794230] ata1: softreset failed (device not ready)
Oct 1 10:36:23 nas kernel: [11561026.794348] ata1: hard resetting link
Oct 1 10:36:34 nas kernel: [11561037.347044] ata1: link is slow to respond, please be patient (ready=0)
Oct 1 10:36:58 nas kernel: [11561061.836907] ata1: softreset failed (device not ready)
Oct 1 10:36:58 nas kernel: [11561061.837033] ata1: limiting SATA link speed to 1.5 Gbps
Oct 1 10:36:58 nas kernel: [11561061.837037] ata1: hard resetting link
Oct 1 10:37:03 nas kernel: [11561067.049328] ata1: softreset failed (device not ready)
Oct 1 10:37:03 nas kernel: [11561067.049452] ata1: reset failed, giving up
Oct 1 10:37:03 nas kernel: [11561067.049537] ata1.00: disabled
Oct 1 10:37:03 nas kernel: [11561067.049569] ata1: EH complete
Oct 1 10:37:03 nas kernel: [11561067.049620] sd 0:0:0:0: [sda] tag#9 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct 1 10:37:03 nas kernel: [11561067.049628] sd 0:0:0:0: [sda] tag#9 CDB: Write(10) 2a 00 1d 51 1c 58 00 00 08 00
Oct 1 10:37:03 nas kernel: [11561067.049632] blk_update_request: I/O error, dev sda, sector 491854936
Oct 1 10:37:03 nas kernel: [11561067.049829] md/raid1:md0: Disk failure on sda1, disabling device.
Oct 1 10:37:03 nas kernel: [11561067.049829] md/raid1:md0: Operation continuing on 1 devices.
Oct 1 10:37:04 nas kernel: [11561067.473249] RAID1 conf printout:
Oct 1 10:37:04 nas kernel: [11561067.473254] --- wd:1 rd:2
Oct 1 10:37:04 nas kernel: [11561067.473258] disk 0, wo:1, o:0, dev:sda1
Oct 1 10:37:04 nas kernel: [11561067.473261] disk 1, wo:0, o:1, dev:sdb1
Oct 1 10:37:04 nas kernel: [11561067.473263] RAID1 conf printout:
Oct 1 10:37:04 nas kernel: [11561067.473265] --- wd:1 rd:2
Oct 1 10:37:04 nas kernel: [11561067.473267] disk 1, wo:0, o:1, dev:sdb1
Oct 1 10:38:02 nas kernel: [11561125.390084] ata1: exception Emask 0x10 SAct 0x0 SErr 0x10200 action 0xe frozen
Oct 1 10:38:02 nas kernel: [11561125.390233] ata1: irq_stat 0x00400000, PHY RDY changed
Oct 1 10:38:02 nas kernel: [11561125.390333] ata1: SError: { Persist PHYRdyChg }
Oct 1 10:38:02 nas kernel: [11561125.390429] ata1: hard resetting link
Oct 1 10:38:12 nas kernel: [11561135.394420] ata1: softreset failed (device not ready)
Oct 1 10:38:12 nas kernel: [11561135.394534] ata1: hard resetting link
Oct 1 10:38:22 nas kernel: [11561145.403213] ata1: softreset failed (device not ready)
Oct 1 10:38:22 nas kernel: [11561145.403326] ata1: hard resetting link
Oct 1 10:38:33 nas kernel: [11561156.384023] ata1: link is slow to respond, please be patient (ready=0)
Oct 1 10:38:57 nas kernel: [11561180.417874] ata1: softreset failed (device not ready)
Oct 1 10:38:57 nas kernel: [11561180.417992] ata1: limiting SATA link speed to 1.5 Gbps
Oct 1 10:38:57 nas kernel: [11561180.417996] ata1: hard resetting link
Oct 1 10:39:02 nas kernel: [11561185.610277] ata1: softreset failed (device not ready)
Oct 1 10:39:02 nas kernel: [11561185.610438] ata1: reset failed, giving up
Oct 1 10:39:02 nas kernel: [11561185.610563] ata1: EH complete
Oct 1 10:39:02 nas kernel: [11561185.610584] ata1.00: detaching (SCSI 0:0:0:0)
Oct 1 10:39:02 nas kernel: [11561185.615666] sd 0:0:0:0: [sda] Stopping disk
Oct 1 10:39:02 nas kernel: [11561185.615718] sd 0:0:0:0: [sda] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Oct 1 10:39:02 nas kernel: [11561185.623057] md: unbind
Oct 1 10:39:02 nas kernel: [11561185.638348] md: export_rdev(sda1)
Oct 1 10:39:05 nas kernel: [11561188.628822] ata1: exception Emask 0x10 SAct 0x0 SErr 0x10200 action 0xe frozen
Oct 1 10:39:05 nas kernel: [11561188.628974] ata1: irq_stat 0x00400000, PHY RDY changed
Oct 1 10:39:05 nas kernel: [11561188.629073] ata1: SError: { Persist PHYRdyChg }
Oct 1 10:39:05 nas kernel: [11561188.629167] ata1: hard resetting link
Oct 1 10:39:15 nas kernel: [11561198.639291] ata1: softreset failed (device not ready)
Oct 1 10:39:15 nas kernel: [11561198.639460] ata1: hard resetting link
Oct 1 10:39:25 nas kernel: [11561208.644055] ata1: softreset failed (device not ready)
Oct 1 10:39:25 nas kernel: [11561208.644166] ata1: hard resetting link
Oct 1 10:39:36 nas kernel: [11561219.612892] ata1: link is slow to respond, please be patient (ready=0)
Oct 1 10:40:00 nas kernel: [11561243.674826] ata1: softreset failed (device not ready)
Oct 1 10:40:00 nas kernel: [11561243.674949] ata1: limiting SATA link speed to 1.5 Gbps
Oct 1 10:40:00 nas kernel: [11561243.674953] ata1: hard resetting link
Oct 1 10:40:04 nas kernel: [11561247.967172] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 1 10:40:04 nas kernel: [11561247.998848] ata1.00: ATA-8: WDC WD20EARX-00PASB0, 51.0AB51, max UDMA/133
Oct 1 10:40:04 nas kernel: [11561247.998855] ata1.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32), AA
Oct 1 10:40:04 nas kernel: [11561248.006775] ata1.00: configured for UDMA/133
Oct 1 10:40:04 nas kernel: [11561248.006793] ata1: EH complete
Oct 1 10:40:04 nas kernel: [11561248.007331] scsi 0:0:0:0: Direct-Access ATA WDC WD20EARX-00P AB51 PQ: 0 ANSI: 5
Oct 1 10:40:04 nas kernel: [11561248.051735] sd 0:0:0:0: Attached scsi generic sg0 type 0
Oct 1 10:40:04 nas kernel: [11561248.053542] sd 0:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
Oct 1 10:40:04 nas kernel: [11561248.053549] sd 0:0:0:0: [sda] 4096-byte physical blocks
Oct 1 10:40:04 nas kernel: [11561248.053739] sd 0:0:0:0: [sda] Write Protect is off
Oct 1 10:40:04 nas kernel: [11561248.053747] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
Oct 1 10:40:04 nas kernel: [11561248.053785] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Oct 1 10:40:04 nas kernel: [11561248.105116] sda: sda1
Oct 1 10:40:04 nas kernel: [11561248.106046] sd 0:0:0:0: [sda] Attached SCSI disk
Oct 1 10:41:04 nas kernel: [11561307.843799] ata1.00: exception Emask 0x0 SAct 0x2000 SErr 0x0 action 0x6 frozen
Oct 1 10:41:04 nas kernel: [11561307.843961] ata1.00: failed command: READ FPDMA QUEUED
Oct 1 10:41:04 nas kernel: [11561307.844074] ata1.00: cmd 60/08:68:00:10:00/00:00:00:00:00/40 tag 13 ncq dma 4096 in
Oct 1 10:41:04 nas kernel: [11561307.844074] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 1 10:41:04 nas kernel: [11561307.844357] ata1.00: status: { DRDY }
Oct 1 10:41:04 nas kernel: [11561307.844437] ata1: hard resetting link
Oct 1 10:41:14 nas kernel: [11561317.860515] ata1: softreset failed (device not ready)
Oct 1 10:41:14 nas kernel: [11561317.860631] ata1: hard resetting link
Oct 1 10:41:24 nas kernel: [11561327.873279] ata1: softreset failed (device not ready)
Oct 1 10:41:24 nas kernel: [11561327.873396] ata1: hard resetting link
Oct 1 10:41:35 nas kernel: [11561338.430084] ata1: link is slow to respond, please be patient (ready=0)
Oct 1 10:41:55 nas kernel: [11561358.719633] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 1 10:41:55 nas kernel: [11561358.763339] ata1.00: configured for UDMA/133
Oct 1 10:41:55 nas kernel: [11561358.763378] ata1: EH complete
Oct 1 10:56:13 nas kernel: [11562216.712719] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Oct 1 10:56:13 nas kernel: [11562216.712876] ata1.00: failed command: SMART
Oct 1 10:56:13 nas kernel: [11562216.712971] ata1.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 23 pio 512 in
Oct 1 10:56:13 nas kernel: [11562216.712971] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Oct 1 10:56:13 nas kernel: [11562216.713245] ata1.00: status: { DRDY }
Oct 1 10:56:13 nas kernel: [11562216.713325] ata1: hard resetting link
Oct 1 10:56:23 nas kernel: [11562226.725433] ata1: softreset failed (device not ready)
Oct 1 10:56:23 nas kernel: [11562226.725549] ata1: hard resetting link
Oct 1 10:56:33 nas kernel: [11562236.742173] ata1: softreset failed (device not ready)
Oct 1 10:56:33 nas kernel: [11562236.742293] ata1: hard resetting link
Oct 1 10:56:44 nas kernel: [11562247.295018] ata1: link is slow to respond, please be patient (ready=0)
Oct 1 10:56:57 nas kernel: [11562260.560020] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Oct 1 10:56:57 nas kernel: [11562260.622162] ata1.00: configured for UDMA/133
Oct 1 10:56:57 nas kernel: [11562260.622216] ata1: EH complete
Of course SMART monitoring using check_smart is active on this NAS, too. But this particular drive did not show any symptoms (besides the growing age obviously) that it would soon be dead. No pending, reallocated or uncorrectable sectors in the SMART attributes.
However a manually launched self-test (long) confirmed it: The drive must be replaced.
root@nas:~# smartctl -t long /dev/sda
... obviously the self-test took a while ...
root@nas:~# smartctl -l selftest /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.9.0-12-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed: read failure 70% 63791 1303241104
# 2 Extended offline Completed without error 00% 45685 -
# 3 Extended offline Completed without error 00% 39967 -
# 4 Extended offline Aborted by host 70% 39961 -
# 5 Extended offline Completed without error 00% 20821 -
# 6 Extended offline Completed without error 00% 4136 -
# 7 Extended offline Completed without error 00% 943 -
# 8 Short offline Completed without error 00% 836 -
Now that the self-test log contains errors, check_smart is able to identify a problem with the drive:
root@nas:~# /usr/lib/nagios/plugins/check_smart -d /dev/sda -i ata -s
WARNING: Drive WDC WD20EARX-00PASB0 S/N WD-WCXXXXXXXXXX: Self-test log contains errors|Raw_Read_Error_Rate=1 Spin_Up_Time=5391 Start_Stop_Count=162 Reallocated_Sector_Ct=0 Seek_Error_Rate=145 Power_On_Hours=63810 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=162 Power-Off_Retract_Count=42 Load_Cycle_Count=127440 Temperature_Celsius=30 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=36
The defective WDC 2TB drive has been replaced with a newer Toshiba PC P300 3TB drive, following this step by step guide how to replace a defect drive. Once the second 2TB WDC drive fails, it will be replaced with another Toshiba 3TB drive and the raid array will be extended. See article Replace hard or solid state drive with a bigger one and grow software raid for more information.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder