Replacing failed Raid Drive

Scenario / Question:

A drive has failed in my raid 1 configuration, and I need to replace it with a new drive.

Solution / Answer:

Use mdadm to fail the drives partition(s) and remove it from the RAID array.

Physically add the new drive to the system and remove the old drive.

Create the same partitioning tables on the new drive that existed on the old drive.

Add the drive partition(s) back into the RAID array.

In this example I have two drives /dev/sda and /dev/sdb. Each drive has 5 partitions and each partition is configured into a RAID 1 array denoted by md#. We will assume that /dev/sdb has failed and that hard drive needs to be replaced.

Note that in Linux Software RAID you can create RAID Arrays by mirroring partitions and not entire disks.

Fail and Remove the failed partitions and disk:

Identify which RAID Arrays have failed:

To identify if a RAID Array has failed look at the string containing [UU]. Each “U” represents an healthy partition in the RAID Array. If you see [UU] then the RAID Array is healthy. If you see a missing “U” like [_U] then the RAID Array is degraded or faulty.

$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
      104320 blocks [2/1] [UU]

md2 : active raid1 sda3[0] sdb3[1]
      2048192 blocks [2/2] [UU]

md3 : active raid1 sda5[0]
      2048192 blocks [2/2] [_U]

md4 : active raid1 sda6[0] sdb6[1]
      2048192 blocks [2/2] [UU]

md5 : active raid1 sda7[0] sdb7[1]
      960269184 blocks [2/2] [UU]

md1 : active raid1 sda2[0] sdb2[1]
      10241344 blocks [2/2] [UU]

From the above out put we can see that RAID Array “md3″ is missing a “U” and is degraded or faulty.

Removing the failed partition(s) and disk:

Before we can physically remove the hard drive from the system we must first “fail” the disks partition(s) from all RAID Arrays that they belong to. Even though only partition /dev/sdb5 or RAID Array md3 has failed, we must manually fail all the other /dev/sdb# partitions that belong to RAID Arrays, before we can remove the hard drive from the system.

To fail the partition we issue the following command:

# mdadm --manage /dev/md0 --fail /dev/sdb1

Repeat this command for each partition changing /dev/md# and /dev/sdb# to match the output from “cat /proc/mdstat”

# mdadm --manage /dev/md1 --fail /dev/sdb2

Removing:

Now that all the partitions are failed we can remove then from the RAID Arrays.

# mdadm --manage /dev/md0 --remove /dev/sdb1

Repeat this command for each partition changing /dev/md# and /dev/sdb# to match the output from “cat /proc/mdstat”

# mdadm --manage /dev/md1 --remove /dev/sdb2

Power off the system and physically replace the hard drive:

# shutdown -h now

Adding the new disk to the RAID Array:

Now that the new hard drive has been physically installed we can add it to the RAID Array.

In order to use the new drive we must create the exact same partition table structure that was on the old drive.

We can use the existing drive and mirror its partition table structure to the new drive. There is an easy command to do this:

# sfdisk -d /dev/sda | sfdisk /dev/sdb

* Note that sometimes when removing drives and replacing them the drives device name may change. Make sure the drive you replaced is listed as /dev/sdb, by issueing command “fdisk -l /dev/sdb” and no partitions exist.

Add the partitions back into the RAID Arrays:

Now that the partitions are configured on the newly installed hard drive, we can add the partitions to the RAID Array.

# # mdadm --manage /dev/md0 --add /dev/sdb1
mdadm: added /dev/sdb1

Repeat this command for each partition changing /dev/md# and /dev/sdb#

# mdadm --manage /dev/md1 --add /dev/sdb2
mdadm: added /dev/sdb2

Now we can check that the partitions are being synchronized by issuing:

# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
      104320 blocks [2/2] [UU]

md2 : active raid1 sda3[2] sdb3[1]
      2048192 blocks [2/1] [_U]
      	resync=DELAYED

md3 : active raid1 sda5[2] sdb5[1]
      2048192 blocks [2/1] [_U]
      	resync=DELAYED

md4 : active raid1 sda6[2] sdb6[1]
      2048192 blocks [2/1] [_U]
      	resync=DELAYED

md5 : active raid1 sda7[2] sdb7[1]
      960269184 blocks [2/1] [_U]
      [>....................]  recovery =  1.8% (17917184/960269184) finish=193.6min speed=81086K/sec

md1 : active raid1 sda2[0] sdb2[1]
      10241344 blocks [2/2] [UU]

Once all drives have synchronized your RAID Array will be back to normail again.

Install Grub on new hard drive MBR:

We need install grub on the MBR of the newly installed hard drive. So that in case the other drive fails the new drive will be able to boot the OS.

Enter the Grub command line:

# grub

Locate grub setup files:

grub> find /grub/stage1

On a RAID 1 with two drives present you should expect to get

(hd0,0)
(hd1,0)

Install grub on the MBR:

grub> device (hd0) /dev/sdb (or /dev/hdb for IDE drives)
grub> root (hd0,0)
grub> setup (hd0)
grub>quit

We made the second drive /dev/sdb device (hd0) because putting grub on it this way puts a bootable mbr on the 2nd drive and when the first drive is missing the second drive will boot.

This will insure that if the first drive in the Raid Array fails or has already failed that you can boot to the Operating System with the second drive.

Fabio Milano has written 80 articles for us. Fabio Milano is a certified RHCE, MCP, and CFOI. He runs an IT consulting and services company called RDS Support inc. Website:http://www.rdssupport.com
The information provided is for educational purposes only. All content including links and comments is provided "as is" with no warranty, expressed or implied. Use is at your own risk and you are solely responsible for what you do with it.

20 Comments so far

  1. elvis123 on August 21st, 2009

    Great howto but you have 2 mistakes: in the section were you add the partitions back into raid you said “–add /dev/sda1″ and “–add /dev/sda2″ should this not be –add /dev/sdb1 and –add /dev/sdb2?

  2. max on February 3rd, 2010

    It’s better to use:

    grub-install /dev/sdb

    instead end of this article

  3. marc on March 22nd, 2010

    max: grub-install works some of the time on some machines (haven’t quite figured out why it doesn’t on about half the ones I try), but the manual grub method outlined here *always* works.

  4. egburr on August 3rd, 2010

    Thanks! I had setup a system with raid 1 many (4 or 5) years ago and a drive failed yesterday. Finding this saved me a lot of time refreshing my memory on how to do swap in the new drive.

  5. till on January 4th, 2011

    thanks so much, exactly what I needed today!

  6. Gabriel Ramuglia on May 4th, 2011

    Brilliant instructions. I’ve rebuilt a couple of arrays before, but this was by far the easiest and least painful instructions for it. The sfdisk command saved me a lot of time.

  7. Benny on July 16th, 2011

    Hi, thanks for the hints.

    Did it in another order, first added the new drive as spare, then set the bad drive faulty

    also if you have hotswap drives you don’t need to reboot the machine ;)

  8. Linux RAID need help. on September 26th, 2011

    [...] RAID array. It also explains how you can rebuild the RAID array once you get the replacement hdd: http://www.kernelhardware.org/replac…ed-raid-drive/ [...]

  9. John on October 20th, 2011

    Hi,

    I actuially found this post while searching for fixes for my “missing” drive problem on my QNAP Ts809u-RP NAS device. The device is an 8 bay NAS that runs a modified linux kernel installed on a Disk on module (DOM), and software raid using mdadm. I have 8 1Tb seagate’s on it in softawre raid 5 and after a recent firmware update to the DOM the drive in bay 6 of my NAS went missing. The drive is healthy as reported by SMART.
    I have tried the follwoing to resolve the issue:
    1: removed and reinstered dirve- the drive showed up and RAID started rebuilding then the drive dissapeared.
    2. Delete the array and reset the device with all drives removed. After this the drive showed up again but was marked as “BAD” just before going missing again.
    3.) placing another drive in the bay and it does not show up.

    Cold boots resets, staritnght edevice with no drives present all fail as well. I can not revert to previous firmware, although that shouldn’t work either as I’ve heard this has happened in all HW versions of the NAS. From what I’ve encvountered in my googling of the topic is that this problem is not limited to the QNAP NAS, the controller or even the managing OS but to software RAID as a whole. I’ve read this missing drive issue on intel controllers in windows as well. Any ideas would be greatly appreciated.

  10. sennai on November 2nd, 2011

    Awesome.
    Thank you verymuch for this great post. I really helped me today.

  11. Bhaskar Chowdhury on December 20th, 2011

    Magnificent…clean and clear way of describing complex thing..kudo to you..

  12. [...] failed software raided drive I referenced these instructions to remind me how to replace a [...]

  13. Ben F on January 28th, 2012

    Thanks for the great write up – I’ve got a failed RAID 1 array on a Centos5 box and this is the clearest practical write up I’ve found.

  14. wth on July 23rd, 2012

    Beautiful! Since disks fail not too often i tend to forget how to do it and this article is best one. Everything included, the correct way. Printed it to PDF, saved to my phone to have it always with me :)
    Thanks!

  15. Anthony on October 16th, 2012

    Why would one have to power off the system to replace a disk?

  16. cheese on October 23rd, 2012

    because its a software raid and not a hardware raid. unless your board supports hot swapping.

  17. Eric Law on February 18th, 2013

    Hi, thanks for the hints.
    But After recovery, it show follow message

    [root@topcomp ~]# cat /proc/mdstat
    Personalities : [raid1]
    md2 : active raid1 sdb1[1] sda1[0]
    1052160 blocks [2/2] [UU]

    md1 : active raid1 sdb2[1] sda2[0]
    2096384 blocks [2/2] [UU]

    md0 : active raid1 sdb3[2] sda3[0]
    153099328 blocks [2/1] [U_]

    unused devices:
    [root@topcomp ~]#

    Why md0 show active raid1 sdb3[2], not is sdb3[1] ?

  18. ThCTlo on June 19th, 2013

    for the grub install use :

    grub-install /dev/sda /dev/sdb

    and your done.

  19. augustin on November 2nd, 2013

    thanks a lot for this article. very helpful! I have rebuild my raid 1

    is there any way to check if mbr is installed ok on both disk’s with no reboot?

    I have replaced sda… all work’s fine

    I have done this tutorial… and on on mbr I run:
    grub
    find /grub/stage1
    device (hd0) /dev/sda (instead…)
    root (hd0,0)
    setup (hd0)
    It give me a succes message…
    quit

    Now I’m wonder if I reboot the server if it will boot…

Leave a reply