Monitor dual storage (raid) controller on IBM x3650 M4

Written by - 0 comments

Published on - Listed in Hardware Linux Monitoring


I wanted to monitor the current RAID status on an IBM x3650 M4 server, simply by using check_raid. I've been using this plugin for years and it supports most software and hardware raid controllers. I've never had any problems with it (once I installed the required cli tools for each hardware controller) - until today.

Due to a very strange hardware setup, inherited from an ex-colleague, the server turns out to have two different RAID controllers active. 12 physical drives are attached to one controller, 2 physical drives to another.

Once I installed the megacli command (from http://hwraid.le-vert.net/), the plugin correctly identified the physical drives behind /dev/sda:

# /usr/lib/nagios/plugins/check_raid -l
megacli
1 active plugins

# /usr/lib/nagios/plugins/check_raid
WARNING: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]

To disable the warning on the disabled WriteCache:

# /usr/lib/nagios/plugins/check_raid --cache-fail=OK
OK: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]

But where are the other two physical drives? From my experience with hardware raid controllers I was pretty sure that megacli is able to detect multiple controllers and is able to retrieve the drive information from all controllers.
A manual verification using megacli still only returned 12 drives:

# megacli -CfgDsply -aall |grep Physical
Physical Disk Information:
Physical Disk: 0
Physical Sector Size:  512
Physical Disk: 1
Physical Sector Size:  512
Physical Disk: 2
Physical Sector Size:  512
Physical Disk: 3
Physical Sector Size:  512
Physical Disk: 4
Physical Sector Size:  512
Physical Disk: 5
Physical Sector Size:  512
Physical Disk Information:
Physical Disk: 0
Physical Sector Size:  512
Physical Disk: 1
Physical Sector Size:  512
Physical Disk: 2
Physical Sector Size:  512
Physical Disk: 3
Physical Sector Size:  512
Physical Disk: 4
Physical Sector Size:  512
Physical Disk: 5
Physical Sector Size:  512

Thankfully a colleague, who recently was working on that particular server, made a screenshot of the storage controller menu during the boot process:

As it turns out, there are two different storage controllers built into that server. One is a MegaRaid controller (ServeRAID M5210) and one is a MPT controller:

# lspci | grep -i LSI
0a:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2004 PCI-Express Fusion-MPT SAS-2 [Spitfire] (rev 03)
14:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)

No wonder megacli wasn't able to find the drives!

I tried again with "mpt-status" (http://hwraid.le-vert.net/wiki/LSIFusionMPT), but this didn't show any config:

# apt-get install mpt-status

/usr/sbin/mpt-status -p
Checking for SCSI ID:0
ioctl: No such device

I removed mpt-status again and went on to try the command "sas2ircu" for newer MPT cards. Finally I got some output:

# apt-get install sas2ircu

# sas2ircu LIST
LSI Corporation SAS2 IR Configuration Utility.
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2009-2013 LSI Corporation. All rights reserved.


         Adapter      Vendor  Device                       SubSys  SubSys
 Index    Type          ID      ID    Pci Address          Ven ID  Dev ID
 -----  ------------  ------  ------  -----------------    ------  ------
   0     SAS2004     1000h    70h   00h:0ah:00h:00h      1014h   040eh
SAS2IRCU: Utility Completed Successfully.

And, hurray, check_raid was now able to read the infos from both controllers:

# /usr/lib/nagios/plugins/check_raid -l
megacli
sas2ircu
2 active plugins

# /usr/lib/nagios/plugins/check_raid --cache-fail=OK
OK: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]; sas2ircu:[ctrl #0: 1 Vols: Optimal: 2 Drives: Optimal (OPT)::]

Update November 15th 2018:

This article helped me again today when the check_raid plugin alarmed of a failed drive:

# sas2ircu-status
-- Controller informations --
-- ID | Model
c0 | SAS2004

-- Arrays informations --
-- ID | Type | Size | Status
c0u0 | RAID1 | 1906G | Degraded (DGD)

-- Disks informations
-- ID | Model | Status
c0u0p0 | WD2000FYYZ-23UL (WDWMC1P051XXXX) | Failed (FLD)
c0u0p1 | WD2000FYYZ-23UL (WDWMC1P0D3XXXX) | Optimal (OPT)

There is at least one disk/array in a NOT OPTIMAL state.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.