How to perform disk replacement (software raid 1) in Linux (mdadm replace failed drive)

Ideally with RAID 1, RAID 5, etc once can easily do a hot HDD swap as they support mirroring at the hardware level but to do the same on a software raid 1 becomes tricky as ideally an OS shutdown is needed to avoid any application impact during the HDD swap.
But with the article I will show you the steps to perform online HDD swap in case any one of your disk drive is broken.
hpssacli rpm can be downloaded from HPE webpage so for the sake of this article I will assume you already have downloaded and installed one on your blade.
NOTE: hpssacli is renamed to ssacli recently due to the HPE name rebranding and split of the industry but since I had older version of hpssacli installed the commands would use 'hpssacli' but the same commands can be used with 'ssacli'
My setup:

  1. HP Proliant BL460c Gen9
  2. Two Internal Disks each 900 GB
  3. Hardware RAID 0 is configured with two Array (each having one disk)
  4. Software RAID 1 is configured on top of these arrays
Correct Disk Maps

Normally the HDD to logical drive mapping is like below
Array A -> Logical Drive 1 (/dev/sda) -> Bay 1
Array B -> Logical Drive 2 (/dev/sdb) -> Bay 2
But still it is good to validate the mapping before starting with the disk swap to make sure correct disk is replaced.

# hpssacli ctrl slot=0  show config detail | grep 'Array:|Logical Drive:|Bay:|Disk'
   Array: A
      Logical Drive: 1
         Disk Name: /dev/sda          Mount Points: None
         Bay: 1
   Array: B
      Logical Drive: 2
         Disk Name: /dev/sdb          Mount Points: None
         Bay: 2

Reverse Disk Maps
Array A -> Logical Drive 1 (/dev/sda) -> Bay 2
Array B -> Logical Drive 2 (/dev/sdb) -> Bay 1
Here the output would like below

# hpssacli ctrl slot=0  show config detail | grep 'Array:|Logical Drive:|Bay:|Disk'
   Array: A
      Logical Drive: 1
         Disk Name: /dev/sda          Mount Points: None
         Bay: 2
   Array: B
      Logical Drive: 2
         Disk Name: /dev/sdb          Mount Points: None
         Bay: 1

 

How to check if my disk is faulty?

There are multiple locations (logs) where enough evidence can be collected to get more details on the faulty disk.
On the iLO logs below message would be available
Right Disk:
Internal Storage Enclosure Device Failure (Bay 1, Box 1, Port 1I, Slot 0)
Left Disk:
Internal Storage Enclosure Device Failure (Bay 2, Box 1, Port 1I, Slot 0)
The Syslog of the OS should contain below messages (assuming hp-ams tool is installed as they report all the hardware relaed alarms)
Right Disk:

Aug 27 07:27:31 mylinux hp-ams[12332]: CRITICAL: Internal Storage Enclosure Device Failure (Bay 1, Box 1, Port 1I, Slot 0)

 
Left Disk:

Aug 27 21:36:29 mylinux hp-ams[12854]: CRITICAL: Internal Storage Enclosure Device Failure (Bay 2, Box 1, Port 1I, Slot 0)

One can also check the Logical Drive status using below command
Logical Drive 1 Failed Status

my-linux-box: # hpssacli ctrl slot=0 ld all show status
   logicaldrive 1 (838.3 GB, 0): Failed
   logicaldrive 2 (838.3 GB, 0): OK

Logical Drive 2 Failed Status

my-linux-box: # hpssacli ctrl slot=0 ld all show status
   logicaldrive 1 (838.3 GB, 0): OK
   logicaldrive 2 (838.3 GB, 0): Failed

 

Logical Drive 1 (/dev/sda) Replacement

Check the raid status
Next re validate the raid status

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sda8[0](F) sdb8[1]
      870112064 blocks super 1.0 [2/1] [_U]
      bitmap: 3/7 pages [12KB], 65536KB chunk
md0 : active raid1 sda5[0](F) sdb5[1]
      529600 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
md3 : active raid1 sda7[0](F) sdb7[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk
md1 : active raid1 sda6[0](F) sdb6[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: <none>

Now remove the failed raid partition

my-linux-box:~ # mdadm /dev/md0 --remove /dev/sda5
mdadm: hot removed /dev/sda5 from /dev/md0
my-linux-box:~ # mdadm /dev/md1 --remove /dev/sda6
mdadm: hot removed /dev/sda6 from /dev/md1
my-linux-box:~ # mdadm /dev/md3  --remove /dev/sda7
mdadm: hot removed /dev/sda7 from /dev/md2
my-linux-box:~ # mdadm /dev/md2 --remove /dev/sda8
mdadm: hot removed /dev/sda8 from /dev/md3

Next check the raid status to validate if all the failed partition have been removed

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sdb8[1]
      870112064 blocks super 1.0 [2/1] [_U]
      bitmap: 3/7 pages [12KB], 65536KB chunk
md0 : active raid1 sdb5[1]
      529600 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
md3 : active raid1 sdb7[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk
md1 : active raid1 sdb6[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: <none>

Replace the failed disk with the new one, the syslog should contain similar message as to below

Aug 18 15:53:12 my-linux-box kernel: [ 8365.422069] hpsa 0000:03:00.0: added scsi 0:2:0:0: Direct-Access     HP       EG0900FBVFQ      RAID-UNKNOWN SSDSmartPathCap- En- Exp=2 qd=30

Re-enable the logical drive using  hpssacli
After re-enabling the logical drive, its required to verify the status which should return “OK”.

my-linux-box: # hpssacli ctrl slot=0 ld 1 modify reenable forced
my-linux-box:# hpssacli ctrl slot=0 ld all show status
   logicaldrive 1 (838.3 GB, 0): OK
   logicaldrive 2 (838.3 GB, 0): OK

sdax is now missing from RAID as expected.

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sdb8[1]
      870112064 blocks super 1.0 [2/1] [_U]
      bitmap: 5/7 pages [20KB], 65536KB chunk
md0 : active raid1 sdb5[1]
      529600 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
md3 : active raid1 sdb7[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 0/1 pages [0KB], 65536KB chunk
md1 : active raid1 sdb6[1]
      4200640 blocks super 1.0 [2/1] [_U]
      bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: <none>

Now copy partition table from sdb to sda.

my-linux-box:~ # sfdisk -d /dev/sdb | grep -v ten | sfdisk /dev/sda –force –no-reread
Checking that no-one is using this disk right now ...
Warning: extended partition does not start at a cylinder boundary.
DOS and Linux will interpret the contents differently.
OK
Disk /dev/sda: 109437 cylinders, 255 heads, 63 sectors/track
Warning: extended partition does not start at a cylinder boundary.
DOS and Linux will interpret the contents differently.
Old situation:
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0
   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/sda1   *      0+ 109437- 109438- 879054336    f  W95 Ext'd (LBA)
/dev/sda2          0       -       0          0    0  Empty
/dev/sda3          0       -       0          0    0  Empty
/dev/sda4          0       -       0          0    0  Empty
/dev/sda5          0+     66-     66-    529664   fd  Linux raid autodetect
/dev/sda6         66+    588-    523-   4200704   fd  Linux raid autodetect
/dev/sda7        589+   1111-    523-   4200704   fd  Linux raid autodetect
/dev/sda8       1112+ 109435- 108324- 870112256   fd  Linux raid autodetect
New situation:
Units = sectors of 512 bytes, counting from 0
   Device Boot    Start       End   #sectors  Id  System
/dev/sda1   *       512 1758109183 1758108672   f  W95 Ext'd (LBA)
/dev/sda2             0         -          0   0  Empty
/dev/sda3             0         -          0   0  Empty
/dev/sda4             0         -          0   0  Empty
/dev/sda5          1024   1060351    1059328  fd  Linux raid autodetect
/dev/sda6       1060864   9462271    8401408  fd  Linux raid autodetect
/dev/sda7       9462784  17864191    8401408  fd  Linux raid autodetect
/dev/sda8      17864704 1758089215 1740224512  fd  Linux raid autodetect
Warning: partition 1 does not end at a cylinder boundary
Successfully wrote the new partition table
Re-reading the partition table ...
If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
Erase possible RAID config data ( from a reused disk)

 

After doing this it is important that possible remaining old SW RAID metadata has to be removed from the new attached disk before it is re-added to the RAIDs.

my-linux-box:~ # mdadm --zero-superblock /dev/sda5
my-linux-box:~ # mdadm --zero-superblock /dev/sda6
my-linux-box:~ # mdadm --zero-superblock /dev/sda7
my-linux-box:~ # mdadm --zero-superblock /dev/sda8

Afterwards the logical volumes can be added again to the SW RAIDs.

my-linux-box:~ # mdadm /dev/md0 --add /dev/sda5
mdadm: added /dev/sdb5
my-linux-box:~ # mdadm /dev/md1 --add /dev/sda6
mdadm: added /dev/sdb6
my-linux-box:~ # mdadm /dev/md3 --add /dev/sda7
mdadm: added /dev/sdb7
my-linux-box:~ # mdadm /dev/md2 --add /dev/sda8

NOTE: Add individual raid partition only once the last added shows as [UU]
 

How to install GRUB on the disk?

Once md0 has synchronised the grub should be installed again on both disks calling the grub installer.
Finally use the command grub-install which should without error message install the grub on both disks (hd0 and hd1).

# grub-install
    GNU GRUB  version 0.97  (640K lower / 3072K upper memory)
 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename. ]
grub> setup --stage2=/boot/grub/stage2 --force-lba (hd0) (hd0,4)
 Checking if "/boot/grub/stage1" exists... yes
 Checking if "/boot/grub/stage2" exists... yes
 Checking if "/boot/grub/e2fs_stage1_5" exists... yes
 Running "embed /boot/grub/e2fs_stage1_5 (hd0)"...  17 sectors are embedded.
succeeded
 Running "install --force-lba --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) (hd0)1+17 p (hd0,4)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded
Done.
grub> setup --stage2=/boot/grub/stage2 --force-lba (hd1) (hd1,4)
 Checking if "/boot/grub/stage1" exists... yes
 Checking if "/boot/grub/stage2" exists... yes
 Checking if "/boot/grub/e2fs_stage1_5" exists... yes
 Running "embed /boot/grub/e2fs_stage1_5 (hd1)"...  17 sectors are embedded.
succeeded
 Running "install --force-lba --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd1) (hd1)1+17 p (hd1,4)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded
Done.
grub> quit

Finally validate the raid status

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid1 sda8[2] sdb8[1]
      870112064 blocks super 1.0 [2/2] [UU]
      bitmap: 6/7 pages [24KB], 65536KB chunk
md0 : active raid1 sda5[2] sdb5[1]
      529600 blocks super 1.0 [2/2] [UU]
      bitmap: 1/1 pages [4KB], 65536KB chunk
md3 : active raid1 sda7[2] sdb7[1]
      4200640 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk
md1 : active raid1 sda6[2] sdb6[1]
      4200640 blocks super 1.0 [2/2] [UU]
      bitmap: 0/1 pages [0KB], 65536KB chunk
unused devices: <none>

Similarly the disk replacement can be performed for second logical drive.