Replace hard drive in Solaris 10 on IBM x3650 with arcconf and zpool

Written by - 0 comments

Published on - Listed in Solaris Hardware Unix ZFS


Solaris and hard drive maintenance tasks.. Oh joy! I've already written articles about it (Solaris: Add a new hard drive to existing zfs pool (with hpacucli) and Solaris: Replace defect HDD with hpacucli and zpool). Both articles were based on Solaris running on HP Proliant servers.

Now I had to replace a defect hard drive on an IBM x3650 server, running with Solaris 10. Different hardware - different story.

First of all: The IBM server and its RSA II did not detect the failed disk. The defect disk was detected by the Nagios plugin check_zpools.sh to monitor the health and usage of ZFS pools.
zpool status showed the following output:

(solaris91 ) 0 # zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        rpool         DEGRADED     0     0     0
          mirror      DEGRADED     0     0     0
            c0t0d0s0  ONLINE       0     0     0
            c0t1d0s0  UNAVAIL      0     0     0  cannot open

errors: No known data errors

  pool: zonepool
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        zonepool      ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c0t2d0s0  ONLINE       0     0     0
            c0t3d0s0  ONLINE       0     0     0

On the HP servers, the physical disks had to be replaced and then activated by hpacucli, the HP command line utility for the raid controller. Like HP has its hpacucli command line utility, there is arcconf for Adaptec raid controllers.
arcconf can be downloaded from the Adaptec website. I downloaded and installed (well, unzipped) arcconf v. 1_2_20532 from http://www.adaptec.com/en-us/speed/raid/storage_manager/arcconf_v1_2_20532_zip.htm  .

(solaris91 ) 0 # unzip arcconf_v1_2_20532.zip
(solaris91 ) 0 # cd solaris_x86
(solaris91 ) 0 # chmod 700 arcconf

That's it. You can launch arcconf directly from the unzipped folder as executable.

Before doing anything, the raid controller needs to be scanned:

(solaris91 ) 0 # ./arcconf rescan 1
Controllers found: 1
 Rescan started in the background and can take upto 10 mins to complete.
Command completed successfully.

Then the status of the physical disks can be displayed (I've cut unnecessary information from the output):

 (solaris91 ) 0 # ./arcconf getconfig 1 pd
Controllers found: 1
----------------------------------------------------------------------
Physical Device information
----------------------------------------------------------------------
      Device #0
         Device is a Hard drive
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device(T:L)       : 0,0(0:0)
         Reported Location                  : Enclosure 0, Slot 0
         Reported ESD(T:L)                  : 2,0(0:0)
         Vendor                             : IBM-ESXS
         Model                              : ST973451SS
         Total Size                         : 70006 MB
      Device #1
         Device is a Hard drive
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device(T:L)       : 0,2(2:0)
         Reported Location                  : Enclosure 0, Slot 2
         Reported ESD(T:L)                  : 2,0(0:0)
         Vendor                             : IBM-ESXS
         Model                              : CBRBA146C3ETS0 N
         Total Size                         : 140013 MB
      Device #2
         Device is a Hard drive
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device(T:L)       : 0,3(3:0)
         Reported Location                  : Enclosure 0, Slot 3
         Reported ESD(T:L)                  : 2,0(0:0)
         Vendor                             : IBM-ESXS
         Model                              : CBRBA146C3ETS0 N
         Total Size                         : 140013 MB
      Device #3
         Device is an Enclosure services device
         Reported Channel,Device(T:L)       : 2,0(0:0)
         Enclosure ID                       : 0
         Type                               : SES2
         Vendor                             : IBM-ESXS
         Model                              : VSC7160
         Firmware                           : 1.07
         Status of Enclosure services device
            Speaker status                  : Not available

Well - interesting. There are only three devices/disks (plus the enclosure) shown in the output. The defect disk seems to be missing (note the row 'Reported Channel, Device').

So far I have the following information: The defect disk is in zpool "rpool" and  its size is 70GB. The problem: There are two disks like that and the server did not detect the failed disk as failed - no LED light indicates the bad disk for me.
Well, arcconf can help here, too. I can identify the working disk by letting its LED blink:

(solaris91 ) 0 # ./arcconf identify 1 device 0 0
Controllers found: 1
Only devices managed by an enclosure processor may be identified
The specified device is blinking.
Press any key to stop the blinking.

It was easy to detect the failed disk once the working one was blinking:

IBM x3650 arcconf identify disk

Once I replaced the failed disk, I relaunched arcconf to see the current state of the disks (once again I removed unnecessary information):

(solaris91 ) 0 # ./arcconf getconfig 1 pd 
Controllers found: 1
----------------------------------------------------------------------
Physical Device information
----------------------------------------------------------------------
      Device #0
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device(T:L)       : 0,0(0:0)
      Device #1
         State                              : Ready
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device(T:L)       : 0,1(1:0)
      Device #2
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device(T:L)       : 0,2(2:0)
      Device #3
         State                              : Online
         Supported                          : Yes
         Transfer Speed                     : SAS 3.0 Gb/s
         Reported Channel,Device(T:L)       : 0,3(3:0)

So finally there are 4 disks detected. But the new disk's state is READY and not ONLINE as the others. To bring the device/disk online, a logical drive/simple volume needs to be created. Remember that zfs is handling raid on this server and not the hardware raid controller. The arcconf help shows how to do it:

(solaris91 ) 1 # ./arcconf create --help
 Usage: CREATE LOGICALDRIVE [Options] [Channel# ID#] ... [noprompt] [nologs]

So what I need to do is to use CREATE with the controller number (1), LOGICALDRIVE, the size (max), the raid type (volume) and the channel id, which can be seen in the "pd" output above (0,1):

(solaris91 ) 3 # ./arcconf create 1 logicaldrive max volume 0,1
Controllers found: 1

Do you want to add a logical device to the configuration?
Press y, then ENTER to continue or press ENTER to abort: y

Creating logical device: LogicalDrv 1

Command completed successfully.

The created logical device can be verified:

solaris91 ) 0 # ./arcconf getconfig 1 ld
Controllers found: 1
----------------------------------------------------------------------
Logical device information
----------------------------------------------------------------------
Logical device number 0
   Logical device name                      : disk0
   RAID level                               : Simple_volume
   Status of logical device                 : Optimal
   Size                                     : 69890 MB

Logical device number 1
   Logical device name                      : LogicalDrv 1
   RAID level                               : Simple_volume
   Status of logical device                 : Optimal
   Size                                     : 69889 MB

Logical device number 2
   Logical device name                      : disk2
   RAID level                               : Simple_volume
   Status of logical device                 : Optimal
   Size                                     : 139890 MB

Logical device number 3
   Logical device name                      : disk3
   RAID level                               : Simple_volume
   Status of logical device                 : Optimal
   Size                                     : 139890 MB

Well... the logical device was created (LogicalDrv 1) but the others are called a bit differently. No problem, arcconf can rename the logical device:

(solaris91 ) 1 # ./arcconf setname 1 logicaldrive 1 disk1
Controllers found: 1
Command completed successfully.

Let's check again the state of the physical disks:

(solaris91 ) 0 # ./arcconf getconfig 1 pd
Controllers found: 1
----------------------------------------------------------------------
Physical Device information
----------------------------------------------------------------------
      Device #0
         State                              : Online

      Device #1
         State                              : Online

      Device #2
         State                              : Online

      Device #3
         State                              : Online

Now we come to the Solaris/ZFS stuff to replace the physical disk in the operating system. First the new disk needs to be Solaris-formatted:

(solaris91 ) 0 # format
Searching for disks...done


AVAILABLE DISK SELECTIONS:
       0. c0t0d0
          /pci@0,0/pci8086,25e3@3/pci1014,9580@0/sd@0,0
       1. c0t1d0
          /pci@0,0/pci8086,25e3@3/pci1014,9580@0/sd@1,0
       2. c0t2d0
          /pci@0,0/pci8086,25e3@3/pci1014,9580@0/sd@2,0
       3. c0t3d0
          /pci@0,0/pci8086,25e3@3/pci1014,9580@0/sd@3,0
Specify disk (enter its number): 1
selecting c0t1d0
[disk formatted]


FORMAT MENU:
        disk       - select a disk
        type       - select (define) a disk type
        partition  - select (define) a partition table
        current    - describe the current disk
        format     - format and analyze the disk
        fdisk      - run the fdisk program
        repair     - repair a defective sector
        label      - write label to the disk
        analyze    - surface analysis
        defect     - defect list management
        backup     - search for backup labels
        verify     - read and display labels
        save       - save new disk/partition definitions
        inquiry    - show vendor, product and revision
        volname    - set 8-character volume name
        !     - execute , then return
        quit
format> fdisk
No fdisk table exists. The default partition for the disk is:

  a 100% "SOLARIS System" partition

Type "y" to accept the default partition,  otherwise type "n" to edit the
 partition table.
y
format> quit

We could use format to create the partition table of the new disk, too, but in this case it is much easier to copy the VTOC (Volume Table of Contents) from the existing disk (c0t0d0):

(solaris91 ) 0 # prtvtoc /dev/rdsk/c0t0d0s2 | fmthard -s - /dev/rdsk/c0t1d0s2
fmthard:  New volume table of contents now in place.

Now that the physical disk was replaced, formatted and partitioned, we can replace it in the zpool:

(solaris91 ) 0 # zpool replace rpool c0t1d0s0

The zpool status output now shows the resilvering (= raid resynchro) of the disks in rpool:

(solaris91 ) 0 # zpool status
  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 1.62% done, 0h5m to go
config:

        NAME                STATE     READ WRITE CKSUM
        rpool               DEGRADED     0     0     0
          mirror            DEGRADED     0     0     0
            c0t0d0s0        ONLINE       0     0     0
            replacing       DEGRADED     0     0     0
              c0t1d0s0/old  FAULTED      0     0     0  corrupted data
              c0t1d0s0      ONLINE       0     0     0

And that's it. Oh joy!


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder