I’ve always been a fan of software RAID in terms of cheap data redundancy. And I simply can’t seem to grasp why some people still swear by the hardware RAID implementation at home and just stubbornly want nothing to know about software RAID. I really like argument 2 of this article: http://augmentedtrader.wordpress.com/2012/05/13/10-things-raid/ because it’s so damn true.
In the past I’ve encountered quite a few harsh situations for my software RAID:
- Motherboard + CPU upgrade
- Twice a disk failure
- Once a disconnected disk
- Removed USB drives from external casing and connected them straight to the motherboard
- Expanded RAID5 from 3, 4 to 5 and finally 6 disks.
Every time my little RAID kept going and going. 5 years ever since Ubuntu 8.04 LTS! However today disaster struck. And it struck badly.
In the past I’ve had my fair share of harsh usage (see above) but this time was different. I’ve received the warning that one of my disks failed and there was no hot spare available to repair.
As stupid as I was, I ssh’ed from my phone and shut down the RAID to reduce any damage.
Big mistake! There where various events happening at the same time:
a) One disk died, no spared and RAID5.
b) The BIOS battery seemed to be dead, all these years of uptime and have taken it’s toll on the battery. So when I booted the system to troubleshoot the RAID, I saw three superblocks with a date somewhere in March 1970 and one dated 28th of November 2013. Ouch!
c) A small hiccup seemed to occur on the server. A superblock of one of the drives seemed to be lost. Just blank, empty. Might have been my power interrupts when I got home. (The NAS runs with no screen so I don’t really know what is going on, and I often reboot the server)
So yeah the superblocks kinda screwed me over. I tried to rebuild the data like I normally did and get the disks back running but all data was lost. There wasn’t enough precise information to make it work and a sync destroyed my data. Such a shame.
What did I learn?
Keep your RAID alive at all costs, add a hotspare as fast as possible! Because once your system goes down there is no way to know what exactly will happen. Also if possible create a backup of your super blocks.
So the future?
My next project will cover RAID5 of 3 disks for non critical data and a RAID1 of 2 disks for critical storage. I’ll reuse two of the old 1TB disks for the mirroring. (as my RAID 5 was built with 6 x 1TB)
In the meantime here are a few of my hints to search the failed disk.
It is really easy if you have the right tools. I am using an old controller found in an external disk. You can plug it in a power source nearby and debug on the fly with usb and a portable computer.
Once connected run:
sudo badblocks -sv /dev/sdi
Note: this can take a very long time
enira@enira-MS-7740:~$ sudo badblocks -sv /dev/sdi Checking blocks 0 to 976762495 Checking for bad blocks (read-only test): 0.00% done, 0:00 elapsed. (0/0/0 err 95.00% done, 7:42:14 elapsed. (0/0/0 errors)
Note: Yes I know, it’s a read-only test. But there is no need for a write/read test. Why? because it’s already quite clear: mdadm already stated that a disk is broken. I don’t need to search for read/write errors. I just need to identify a
shitty failed disk.