Which Hard Drive Failed in my Linux software RAID array?
One of the items I’m always asked about, when it comes to replacing a failed drive in Linux software RAID, is “Which physical drive failed?”
Linux maps it’s hard drives using UDEV, and doesn’t guarantee a drive that is mounted at sda today, will be mounted as sda tomorrow. It may mount as sdb, or sdc. So, when you get a failed drive, and /proc/mdstat tell you /dev/sdc1 failed, how do you know which physical drive it really is? In RAID-5, if you pull the wrong physical drive, chances are you’ve lost all your data. In RAID-6, you get an extra chance.
The answer all boils down to the drives serial number. You need to find the mapping between the mounted drive and it’s serial number via ‘lshw’.
# lshw -class disk*-disk:4 description: ATA Disk product: WDC WD2001FASS-0 vendor: Western Digital physical id: 0.0.0 bus info: scsi@5:0.0.0 logical name: /dev/sde version: 01.0 serial: WD-WMAUR0279543 size: 1863GiB (2TB) capabilities: partitioned partitioned:dos configuration: ansiversion=5 signature=0007bd23
The two lines from the above output that concern us are ‘logical name’ and ‘serial’ The serial number will match what is printed on the hard drives label, and the logical name will match the failed drive in /proc/mdstat.
Just to make things easier for me, rather than pull all the drives out of a box looking for the proper one, I keep a list of the drives, their positions and serial numbers. I keep this list with the system, and reference it if I need to replace a drive. I also make sure to update it with the new drives information.
Another way I do it, is to simply use a label maker to place the serial number into a more visible position on the drive.
All you have to be able to do is see the serial number clearly. It’s way better than pulling them all until you find the right one.

Posted by admin in