Header RSS Feed
 
If you only want to see the articles of a certain category, please click on the desired category below:
ALL Android Backup BSD Database Hacks Hardware Internet Linux Mail MySQL Monitoring Network Personal PHP Proxy Shell Solaris Unix Virtualization VMware Windows Wyse

check_smart.pl saves server lifes (defect hard drive detected)
Tuesday - Nov 5th 2013 - by - (0 comments)

Wow! Who'd thought that the new version of check_smart.pl (see last article "check_smart.pl adapted to support cciss and handle grown defect list" for details) would become a life saver for old servers!

Usually I monitor all server hardware through ILO with the Nagios plugin check_ilo2_health.pl. Unfortunately the hard drive monitoring was only added in newer ILO3 firmwares. Therefore all ILO2 servers still running (e.g. ProLiant G5 servers) are kind of "in the grey" when it comes to hardware monitoring.

When check_smart.pl is correctly used, this can be a life saver. The following screenshot talks for itself, doesn't it?

check_smart detected disk failure on FreeBSD

Phew... That was close!

Update November 6th 2013
As soon as I removed disk #4 from the chassis, the server/raid controller finally detected a disk as failed. Disk #1 started to blink red (before that, the server's LED's were all green and ILO showed server health as OK, too). In dmesg the following entries appeared:

ciss0: *** Drive failure imminent, Port=1I Box=1 Bay=1
ciss0: *** Hot-plug drive removed, Port=1I Box=1 Bay=4
ciss0: *** Physical drive failure, Port=1I Box=1 Bay=4
ciss0: *** State change, logical drive 1
ciss0: logical drive 1 (da1) changed status OK->interim recovery, spare status 0x0
ciss0: *** Hot-plug drive inserted, Port=1I Box=1 Bay=4
ciss0: *** State change, logical drive 1
ciss0: logical drive 1 (da1) changed status interim recovery->ready for recovery, spare status 0x0
ciss0: *** State change, logical drive 1
ciss0: logical drive 1 (da1) changed status ready for recovery->recovering, spare status 0x0

Then it was up to disk #1 to be replaced:

ciss0: *** Hot-plug drive removed, Port=1I Box=1 Bay=1
ciss0: *** Physical drive failure, Port=1I Box=1 Bay=1
ciss0: *** State change, logical drive 0
ciss0: logical drive 0 (da0) changed status OK->interim recovery, spare status 0x0
ciss0: *** Hot-plug drive inserted, Port=1I Box=1 Bay=1
ciss0: *** State change, logical drive 0
ciss0: logical drive 0 (da0) changed status interim recovery->ready for recovery, spare status 0x0

I also saw that the raid recovery didn't start on logical drive 0 (physical drives 1+2) yet because the recovery of logical drive 1 (physical drives 3+4) was still running. So it seems that the raid controller can only do a raid recovery on one logical drive at a time. As soon as the first recovery was finished, the second started immediately:

ciss0: *** State change, logical drive 1
ciss0: logical drive 1 (da1) changed status recovering->OK, spare status 0x0
ciss0: *** State change, logical drive 0
ciss0: logical drive 0 (da0) changed status ready for recovery->recovering, spare status 0x0
ciss0: *** State change, logical drive 0
ciss0: logical drive 0 (da0) changed status recovering->OK, spare status 0x0

 

Add a comment

Show form to leave a comment

Comments (newest first):

No comments yet.

Go to Homepage home
Linux Howtos how to's
Monitoring Plugins monitoring plugins
Links links

Valid HTML 4.01 Transitional
Valid CSS!
[Valid RSS]

7485 Days
until Death of Computers
Why?