• VMware

    Learn about VMware virtualization for its products like vsphere ESX and ESXi, vCenter Server, VMware View, VMware P2V and many more

  • Linux

    Step by step configuration tutorials for many of the Linux services like DNS, DHCP, FTP, Samba4 etc including many tips and tricks in Red Hat Linux.

  • Database

    Learn installation and configuration of databases like Oracle, My SQL, Postgresql, etc including many other related tutorials in Linux.

  • How to perform disk replacement (software raid 1) in Linux (mdadm replace failed drive)

    Ideally with RAID 1, RAID 5, etc once can easily do a hot HDD swap as they support mirroring at the hardware level but to do the same on a software raid 1 becomes tricky as ideally an OS shutdown is needed to avoid any application impact during the HDD swap.

    But with the article I will show you the steps to perform online HDD swap in case any one of your disk drive is broken.

    hpssacli rpm can be downloaded from HPE webpage so for the sake of this article I will assume you already have downloaded and installed one on your blade.

    NOTE: hpssacli is renamed to ssacli recently due to the HPE name rebranding and split of the industry but since I had older version of hpssacli installed the commands would use 'hpssacli' but the same commands can be used with 'ssacli'

    My setup:
    1. HP Proliant BL460c Gen9
    2. Two Internal Disks each 900 GB
    3. Hardware RAID 0 is configured with two Array (each having one disk)
    4. Software RAID 1 is configured on top of these arrays
    Correct Disk Maps
    Normally the HDD to logical drive mapping is like below
    Array A -> Logical Drive 1 (/dev/sda) -> Bay 1
    Array B -> Logical Drive 2 (/dev/sdb) -> Bay 2

    But still it is good to validate the mapping before starting with the disk swap to make sure correct disk is replaced.
    # hpssacli ctrl slot=0  show config detail | grep 'Array:\|Logical Drive:\|Bay:\|Disk'
       Array: A
         
    Logical Drive: 1
             Disk Name: /dev/sda  
           Mount Points: None
             Bay: 1
       Array: B
          Logical Drive: 2
             Disk Name: /dev/sdb  
           Mount Points: None
             Bay: 2

    Reverse Disk Maps
    Array A -> Logical Drive 1 (/dev/sda) -> Bay 2
    Array B -> Logical Drive 2 (/dev/sdb) -> Bay 1

    Here the output would like below
    # hpssacli ctrl slot=0  show config detail | grep 'Array:\|Logical Drive:\|Bay:\|Disk'
       Array: A
          Logical Drive: 1
             
    Disk Name: /dev/sda          Mount Points: None
             Bay: 2

       Array: B
          Logical Drive: 2
             Disk Name: /dev/sdb          Mount Points: None
             Bay: 1

    How to check if my disk is faulty?

    There are multiple locations (logs) where enough evidence can be collected to get more details on the faulty disk.





    On the iLO logs below message would be available
    Right Disk:
    Internal Storage Enclosure Device Failure (Bay 1, Box 1, Port 1I, Slot 0)
    Left Disk:
    Internal Storage Enclosure Device Failure (Bay 2, Box 1, Port 1I, Slot 0)
    The Syslog of the OS should contain below messages (assuming hp-ams tool is installed as they report all the hardware relaed alarms)

    Right Disk:
    Aug 27 07:27:31 mylinux hp-ams[12332]: CRITICAL: Internal Storage Enclosure Device Failure (Bay 1, Box 1, Port 1I, Slot 0)
    Left Disk:
    Aug 27 21:36:29 mylinux hp-ams[12854]: CRITICAL: Internal Storage Enclosure Device Failure (Bay 2, Box 1, Port 1I, Slot 0)
    One can also check the Logical Drive status using below command

    Logical Drive 1 Failed Status
    my-linux-box: # hpssacli ctrl slot=0 ld all show status

       
    logicaldrive 1 (838.3 GB, 0): Failed
       logicaldrive 2 (838.3 GB, 0): OK

    Logical Drive 2 Failed Status
    my-linux-box: # hpssacli ctrl slot=0 ld all show status

       logicaldrive 1 (838.3 GB, 0): OK
       
    logicaldrive 2 (838.3 GB, 0): Failed


    Logical Drive 1 (/dev/sda) Replacement

    Check the raid status
    # cat /proc/mdstat
    Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
    md2 : active raid1
    sda8[0](F) sdb8[1]
          870112064 blocks super 1.0 [2/1]
    [_U]
          bitmap: 1/7 pages [4KB], 65536KB chunk

    md0 : active raid1 sda5[0] sdb5[1]
          529600 blocks super 1.0 [2/2] [UU]
          bitmap: 0/1 pages [0KB], 65536KB chunk

    md3 : active raid1 sda7[0] sdb7[1]
          4200640 blocks super 1.0 [2/2] [UU]
          bitmap: 0/1 pages [0KB], 65536KB chunk

    md1 : active raid1 sda6[0] sdb6[1]
          4200640 blocks super 1.0 [2/1] [UU]
          bitmap: 1/1 pages [4KB], 65536KB chunk

    Manually set the remaining partition as fail
    my-linux-box:~ # mdadm /dev/md0 --fail /dev/sda5
    mdadm: set /dev/sda5 faulty in /dev/md0

    my-linux-box:~ # mdadm /dev/md1 --fail /dev/sda6
    mdadm: set /dev/sda6 faulty in /dev/md1

    my-linux-box:~ # mdadm /dev/md3 --fail /dev/sda7
    mdadm: set /dev/sda7 faulty in /dev/md3


    Next re validate the raid status
    # cat /proc/mdstat
    Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
    md2 : active raid1
    sda8[0](F) sdb8[1]
          870112064 blocks super 1.0 [2/1] [_U]
          bitmap: 3/7 pages [12KB], 65536KB chunk

    md0 : active raid1 sda5[0](F) sdb5[1]
          529600 blocks super 1.0 [2/1] [_U]
          bitmap: 1/1 pages [4KB], 65536KB chunk

    md3 : active raid1 sda7[0](F) sdb7[1]
          4200640 blocks super 1.0 [2/1] [_U]
          bitmap: 0/1 pages [0KB], 65536KB chunk

    md1 : active raid1 sda6[0](F) sdb6[1]
          4200640 blocks super 1.0 [2/1] [_U]
          bitmap: 1/1 pages [4KB], 65536KB chunk

    unused devices: <none>

    Now remove the failed raid partition
    my-linux-box:~ # mdadm /dev/md0 --remove /dev/sda5
    mdadm: hot removed /dev/sda5 from /dev/md0

    my-linux-box:~ # mdadm /dev/md1 --remove /dev/sda6
    mdadm: hot removed /dev/sda6 from /dev/md1

    my-linux-box:~ # mdadm /dev/md3  --remove /dev/sda7
    mdadm: hot removed /dev/sda7 from /dev/md2

    my-linux-box:~ # mdadm /dev/md2 --remove /dev/sda8
    mdadm: hot removed /dev/sda8 from /dev/md3

    Next check the raid status to validate if all the failed partition have been removed
    # cat /proc/mdstat
    Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
    md2 : active raid1 sdb8[1]
          870112064 blocks super 1.0 [2/1]
    [_U]
          bitmap: 3/7 pages [12KB], 65536KB chunk

    md0 : active raid1 sdb5[1]
          529600 blocks super 1.0 [2/1] [_U]
          bitmap: 1/1 pages [4KB], 65536KB chunk

    md3 : active raid1 sdb7[1]
          4200640 blocks super 1.0 [2/1] [_U]
          bitmap: 0/1 pages [0KB], 65536KB chunk

    md1 : active raid1 sdb6[1]
          4200640 blocks super 1.0 [2/1] [_U]
          bitmap: 1/1 pages [4KB], 65536KB chunk

    unused devices: <none>

    Replace the failed disk with the new one, the syslog should contain similar message as to below
    Aug 18 15:53:12 my-linux-box kernel: [ 8365.422069] hpsa 0000:03:00.0: added scsi 0:2:0:0: Direct-Access     HP       EG0900FBVFQ      RAID-UNKNOWN SSDSmartPathCap- En- Exp=2 qd=30
    Re-enable the logical drive using  hpssacli

    After re-enabling the logical drive, its required to verify the status which should return “OK”.
    my-linux-box: # hpssacli ctrl slot=0 ld 1 modify reenable forced

    my-linux-box:# hpssacli ctrl slot=0 ld all show status

       
    logicaldrive 1 (838.3 GB, 0): OK
       logicaldrive 2 (838.3 GB, 0): OK

    sdax is now missing from RAID as expected.
    # cat /proc/mdstat
    Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
    md2 : active raid1 sdb8[1]
          870112064 blocks super 1.0 [2/1] [_U]
          bitmap: 5/7 pages [20KB], 65536KB chunk

    md0 : active raid1 sdb5[1]
          529600 blocks super 1.0 [2/1] [_U]
          bitmap: 1/1 pages [4KB], 65536KB chunk

    md3 : active raid1 sdb7[1]
          4200640 blocks super 1.0 [2/1] [_U]
          bitmap: 0/1 pages [0KB], 65536KB chunk

    md1 : active raid1 sdb6[1]
          4200640 blocks super 1.0 [2/1] [_U]
          bitmap: 1/1 pages [4KB], 65536KB chunk

    unused devices: <none>

    Now copy partition table from sdb to sda.
    my-linux-box:~ # sfdisk -d /dev/sdb | grep -v ten | sfdisk /dev/sda –force –no-reread
    Checking that no-one is using this disk right now ...
    Warning: extended partition does not start at a cylinder boundary.
    DOS and Linux will interpret the contents differently.
    OK

    Disk /dev/sda: 109437 cylinders, 255 heads, 63 sectors/track
    Warning: extended partition does not start at a cylinder boundary.
    DOS and Linux will interpret the contents differently.
    Old situation:
    Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

       Device Boot Start     End   #cyls    #blocks   Id  System
    /dev/sda1   *      0+ 109437- 109438- 879054336    f  W95 Ext'd (LBA)
    /dev/sda2          0       -       0          0    0  Empty
    /dev/sda3          0       -       0          0    0  Empty
    /dev/sda4          0       -       0          0    0  Empty
    /dev/sda5          0+     66-     66-    529664   fd  Linux raid autodetect
    /dev/sda6         66+    588-    523-   4200704   fd  Linux raid autodetect
    /dev/sda7        589+   1111-    523-   4200704   fd  Linux raid autodetect
    /dev/sda8       1112+ 109435- 108324- 870112256   fd  Linux raid autodetect
    New situation:
    Units = sectors of 512 bytes, counting from 0

       Device Boot    Start       End   #sectors  Id  System
    /dev/sda1   *       512 1758109183 1758108672   f  W95 Ext'd (LBA)
    /dev/sda2             0         -          0   0  Empty
    /dev/sda3             0         -          0   0  Empty
    /dev/sda4             0         -          0   0  Empty
    /dev/sda5          1024   1060351    1059328  fd  Linux raid autodetect
    /dev/sda6       1060864   9462271    8401408  fd  Linux raid autodetect
    /dev/sda7       9462784  17864191    8401408  fd  Linux raid autodetect
    /dev/sda8      17864704 1758089215 1740224512  fd  Linux raid autodetect
    Warning: partition 1 does not end at a cylinder boundary
    Successfully wrote the new partition table

    Re-reading the partition table ...

    If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
    to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
    (See fdisk(8).)

    Erase possible RAID config data ( from a reused disk)

    After doing this it is important that possible remaining old SW RAID metadata has to be removed from the new attached disk before it is re-added to the RAIDs.
    my-linux-box:~ # mdadm --zero-superblock /dev/sda5

    my-linux-box:~ # mdadm --zero-superblock /dev/sda6

    my-linux-box:~ # mdadm --zero-superblock /dev/sda7

    my-linux-box:~ # mdadm --zero-superblock /dev/sda8

    Afterwards the logical volumes can be added again to the SW RAIDs.
    my-linux-box:~ # mdadm /dev/md0 --add /dev/sda5
    mdadm: added /dev/sdb5

    my-linux-box:~ # mdadm /dev/md1 --add /dev/sda6
    mdadm: added /dev/sdb6

    my-linux-box:~ # mdadm /dev/md3 --add /dev/sda7
    mdadm: added /dev/sdb7

    my-linux-box:~ # mdadm /dev/md2 --add /dev/sda8

    NOTE: Add individual raid partition only once the last added shows as [UU]

    How to install GRUB on the disk?

    Once md0 has synchronised the grub should be installed again on both disks calling the grub installer.
    Finally use the command grub-install which should without error message install the grub on both disks (hd0 and hd1).
    # grub-install


        GNU GRUB  version 0.97  (640K lower / 3072K upper memory)

     [ Minimal BASH-like line editing is supported.  For the first word, TAB
       lists possible command completions.  Anywhere else TAB lists the possible
       completions of a device/filename. ]
    grub> setup --stage2=/boot/grub/stage2 --force-lba (hd0) (hd0,4)
     Checking if "/boot/grub/stage1" exists... yes
     Checking if "/boot/grub/stage2" exists... yes
     Checking if "/boot/grub/e2fs_stage1_5" exists... yes
     Running "embed /boot/grub/e2fs_stage1_5 (hd0)"...  17 sectors are embedded.
    succeeded
     Running "install --force-lba --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd0) (hd0)1+17 p (hd0,4)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded
    Done.
    grub> setup --stage2=/boot/grub/stage2 --force-lba (hd1) (hd1,4)
     Checking if "/boot/grub/stage1" exists... yes
     Checking if "/boot/grub/stage2" exists... yes
     Checking if "/boot/grub/e2fs_stage1_5" exists... yes
     Running "embed /boot/grub/e2fs_stage1_5 (hd1)"...  17 sectors are embedded.
    succeeded
     Running "install --force-lba --stage2=/boot/grub/stage2 /boot/grub/stage1 (hd1) (hd1)1+17 p (hd1,4)/boot/grub/stage2 /boot/grub/menu.lst"... succeeded
    Done.
    grub> quit

    Finally validate the raid status
    # cat /proc/mdstat
    Personalities : [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
    md2 : active raid1 sda8[2] sdb8[1]
          870112064 blocks super 1.0 [2/2]
    [UU]
          bitmap: 6/7 pages [24KB], 65536KB chunk

    md0 : active raid1 sda5[2] sdb5[1]
          529600 blocks super 1.0 [2/2] [UU]
          bitmap: 1/1 pages [4KB], 65536KB chunk

    md3 : active raid1 sda7[2] sdb7[1]
          4200640 blocks super 1.0 [2/2] [UU]
          bitmap: 0/1 pages [0KB], 65536KB chunk

    md1 : active raid1 sda6[2] sdb6[1]
          4200640 blocks super 1.0 [2/2] [UU]
          bitmap: 0/1 pages [0KB], 65536KB chunk

    unused devices: <none>


     Similarly the disk replacement can be performed for second logical drive.

    I hope the article was useful.

    Related Articles:
    Collect Virtual Connect Support Dump of HP c-Class Blade Enclosures
    How to collect/generate "Show All" report for HP c-Class Blade Enclosures
    How to downgrade HP Emluex CNA NIC card firmware version

    Follow the below links for more tutorials

    How to find the path of any command in Linux
    How to configure a Clustered Samba share using ctdb in Red Hat Cluster
    How to delete an iscsi-target from openfiler and Linux
    How to perform a local ssh port forwarding in Linux
    How to use yum locally without internet connection using cache?
    What is umask and how to change the default value permanently?
    Understanding Partition Scheme MBR vs GPT
    How does a successful or failed login process works in Linux
    How to find all the process accessing a file in Linux
    How to exclude multiple directories from du command in Linux
    How to configure autofs in Linux and what are its advantages?
    How to resize software raid partition in Linux
    How to configure Software RAID 1 mirroring in Linux
    How to prevent a command from getting stored in history in Linux

    How to perform disk replacement (software raid 1) in Linux (mdadm replace failed drive) How to perform disk replacement (software raid 1) in Linux (mdadm replace failed drive) Reviewed by Deepak Prasad on Friday, May 26, 2017 Rating: 5

    No comments:

    Powered by Blogger.