In my last post I announced a new release of the check_smart monitoring plugin, that it would now check additional SMART attributes (not just Current_Pending_Sector). And as soon as I rolled out the new version on to the servers, I was immediately alarmed about a failing SSD:
=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: SAMSUNG SSD PM810 2.5" 7mm 128GB
Serial Number: XXXXXXXXXXXXXX
LU WWN Device Id: 5 0000f0 000000000
Firmware Version: AXM08D1Q
User Capacity: 128,035,676,160 bytes [128 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS, ATA/ATAPI-7 T13/1532D revision 1
SATA Version is: SATA 2.6, 3.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Jun 6 20:32:23 2019 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
[...]
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 099 099 --- Pre-fail Always - 16
9 Power_On_Hours 0x0032 095 095 --- Old_age Always - 25108
12 Power_Cycle_Count 0x0032 099 099 --- Old_age Always - 890
175 Program_Fail_Count_Chip 0x0032 099 099 --- Old_age Always - 11
176 Erase_Fail_Count_Chip 0x0032 100 100 --- Old_age Always - 0
177 Wear_Leveling_Count 0x0013 075 075 --- Pre-fail Always - 877
178 Used_Rsvd_Blk_Cnt_Chip 0x0013 080 080 --- Pre-fail Always - 396
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 082 082 --- Pre-fail Always - 722
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013 082 082 --- Pre-fail Always - 3310
181 Program_Fail_Cnt_Total 0x0032 099 099 --- Old_age Always - 16
182 Erase_Fail_Count_Total 0x0032 100 100 --- Old_age Always - 0
183 Runtime_Bad_Block 0x0013 099 099 --- Pre-fail Always - 16
187 Uncorrectable_Error_Cnt 0x0032 067 067 --- Old_age Always - 33281
195 ECC_Error_Rate 0x001a 001 001 --- Old_age Always - 33281
198 Offline_Uncorrectable 0x0030 100 100 --- Old_age Offline - 0
199 CRC_Error_Count 0x003e 253 253 --- Old_age Always - 0
232 Available_Reservd_Space 0x0013 080 080 --- Pre-fail Always - 1620
241 Total_LBAs_Written 0x0032 037 037 --- Old_age Always - 2708399316
242 Total_LBAs_Read 0x0032 035 035 --- Old_age Always - 2781759092
16 already reallocated sectors (which on this drive were also counted as Program_Fail_Cnt_Total and Runtime_Bad_Block) and more than 33'000 non-correctable errors! The SSD however is quite "old", given the drive was running for more than 25'000 hours and it's also a drive of an older solid state generation.
That drive is part of a software RAID-1, managed by mdadm, which is shown as physical volume (PV) to the Logical Volume Manager (LVM):
# cat /proc/mdstat
[...]
md3 : active raid1 sdc1[1] sdb1[0]
124968256 blocks super 1.2 [2/2] [UU]
bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices:
# pvs | grep md3
/dev/md3 vgssd lvm2 a-- 119.18g 0
I thought that's a great moment to increase that raid by replacing both 128GB drives with two newer 224GB drives.
To keep the data, I first removed the drive /dev/sdc (with the huge amount of errors in the SMART table above) following a previous step by step guide I once wrote (Some notes on how to replace a HDD in software raid).
After I physically replaced the drive, I did one step differently than in the mentioned guide: Instead of copying the partition table from the still remaining drive (/dev/sdb) I manually created a new partition, filling up the whole drive:
# fdisk /dev/sdc
Welcome to fdisk (util-linux 2.29.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
Command (m for help): n
Partition type
p primary (0 primary, 0 extended, 4 free)
e extended (container for logical partitions)
Select (default p): p
Partition number (1-4, default 1):
First sector (2048-468877311, default 2048):
Last sector, +sectors or +size{K,M,G,T,P} (2048-468877311, default 468877311):
Created a new partition 1 of type 'Linux' and of size 223.6 GiB.
Command (m for help): t
Selected partition 1
Partition type (type L to list all types): da
Changed type of partition 'Linux' to 'Non-FS data'.
Command (m for help): w
The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
Note: I set the partition type to "Non-FS data" (da), because the previously used partition type for software raids fd (Linux raid autodetect) is now deprecated.
Then I added the new drive /dev/sdc into the still existing raid-1 (md3):
# mdadm /dev/md3 -a /dev/sdc1
mdadm: added /dev/sdc1
Of course this raid device now needs to rebuild:
# cat /proc/mdstat
[...]
md3 : active raid1 sdb1[3] sdc1[2]
124968256 blocks super 1.2 [2/1] [_U]
[>....................] recovery = 0.6% (801792/124968256) finish=10.3min speed=200448K/sec
bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices:
I waited until the raid was rebuilt. Of course at that moment, the raid itself still runs with the old size, because the older drive (/dev/sdb) is still a 128GB drive.
Now it's time to replace the second drive (/dev/sdb). I did the exact same steps as before with /dev/sdc, following the mdadm drive replacement guide but with the bigger partition.
Once the raid was rebuilt (once more) the raid now runs with two 224GB drives, yet the raid is still limited to the old 128GB size. Growing/expanding a raid device is actually very easy:
# mdadm --grow /dev/md3 --size=max
mdadm: component size of /dev/md3 has been set to 234372096K
This enforces a resync on the raid device:
# cat /proc/mdstat
[...]
md3 : active raid1 sdc1[2] sdb1[3]
234372096 blocks super 1.2 [2/2] [UU]
[===============>.....] resync = 78.2% (183370304/234372096) finish=6.3min speed=134554K/sec
bitmap: 1/1 pages [4KB], 131072KB chunk
unused devices:
Note the larger sizes (sectors) behind the current sync status in percent.
Once the resync was completed, the PV can now be increased:
# pvresize /dev/md3
Physical volume "/dev/md3" changed
1 physical volume(s) resized / 0 physical volume(s) not resized
VoilĂ , due to the grown PV, the volume group (VG) now has more space available:
# pvs | grep md3
/dev/md3 vgssd lvm2 a-- 223.51g 104.34g
Cristian from wrote on Oct 16th, 2019:
Thanks,
works like a charm
AWS Android Ansible Apache Apple Atlassian Automation BSD Backup Bash Bluecoat CMS Chef Cloud Consul Container Containers CouchDB DB DNS Database Databases Docker ELK ElasticSearch Elasticsearch Filebeat FreeBSD GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Icingaweb2 InfluxDB Internet Java Kibana Kubernetes LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Rancher SSL Security Shell SmartOS Solaris Surveillance SystemD TLS Tomcat Ubuntu Unix VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder