Monday, April 20, 2009

Software RAID on an existing linux system

So, as mentioned in a previous article I bought a terabyte hard disk, mainly to impress the ladies. It doesn't seem to have been working too well in that respect. I considered sending it back, claiming it needs more testerone. But I decided to instead set up software RAID 1.

My main guide for this task was this article: http://www.howtoforge.com/software-raid1-grub-boot-debian-etch
Rather than repeat everything I will just comment on what I did differently. Note that despite the name of that article, it applied fine to Ubuntu 7, and I suspect it will apply equally well to any linux distro of the past two or three years.

First, my disks were different size: 320G vs. 1T. So I opened fdisk -l /dev/sda in one terminal, then ran fdisk /dev/sdb in a second terminal and set up the first six partitions (plus one extended partition) to be the same as /dev/sda. I then set up /dev/sdb8 to be one big partition with the rest of the disk, which I mounted as /backup.

In all software RAID guides they number the partitions /dev/md0, /dev/md1, etc. That is confusing so I decided to keep the same numbering schemes as my hard disks; it turns out this is fine. So I used commands like this:
mdadm --create /dev/md3 --level=1 --raid-disks=2 missing /dev/sdb3

compared to this from the HowToForge article:
mdadm --create /dev/md2 --level=1 --raid-disks=2 missing /dev/sdb3


The above article did all partitions in one go. I wanted to take baby steps. So I started by mirroring one partition, choosing a non-critical one, and ignoring all that stuff about grub configuration. It worked well. I then did /var and /, leaving /boot unmirrored and therefore again ignoring all that grub configuration stuff. /var went smoothly but / did not. It kept saying /dev/sda3 was in use: "mdadm: Cannot open /dev/sda3: Device or resource busy" But df didn't list it. Trawling /var/log/* didn't list it. "Eh?" thought I.

Eventually I realized that grub was set to boot from /dev/sda3, not from /. I.e. grub works at the device level not the partition level.

Key Advice: mirroring one partition at a time is fine, but then do "/", "/boot" and all that grub stuff in one go.

Now, something the HowToForge article didn't mention at all was doing file copies in single-user mode. I know just enough unix administration to scare myself and the idea of taking a copy of /var and / while all the daemons were still running made me nervous. So I tried to boot to single-user mode.

Discovery #1: Ubuntu is weird. They don't use run levels like the rest of the linux world. Sure, their numbering scheme may make more sense, but it also causes much surprise.

Discovery #2: "telinit 1" doesn't work. The GUI shuts down and then nothing. I can see the system is still running but nothing on screen. Nothing on Ctrl-Alt-F1 through to Ctrl-Alt-F12. I think ctrl-alt-del caused a reboot.

So, I booted, and choose the "recovery kernel" from the grub menu. Recovery being another name for single-user mode. It asks for the root password. Ah. Ubuntu doesn't use one. I have to continue into a normal GUI boot.

Key Advice: Set a root password. This has nothing to do with software RAID: you never know when you are going to want to boot into a recovery kernel. I think Ubuntu deserves a forehead slap for this design decision? To set a root password, do "sudo passwd": it will first ask for your user password (that is how sudo works), then it will say "Enter new UNIX password:". I chose the same password as my normal user; easy to remember. (If that makes you nervous, remember that this is equal security to being able to run "sudo passwd"; more importantly, it saves me having to write down my root password somewhere.)

Other comments: the "cp -dpRx" command is the most time-consuming step for those partitions when you have a lot of data. The system will be sluggish while doing this (and if you are in single-user mode you cannot be browsing or doing anything else anyway). But also when you do the "mdadm --add /dev/md3 /dev/sda3" to actually create a genuine RAID partition it will be copying a lot of data, and your system will be sluggish for the time this takes (about 10 minutes per 30G). Bear this in mind.

Swap. I've created /dev/md2 as a swap RAID, but haven't used it. So /dev/md2 just contains /dev/sdb2. And "cat /proc/swaps" tells me only /dev/sda2. Apparently if my sda disk dies my swap will disappear and my system might crash. On the other hand, mirroring swap has a performance downside (apparently). I don't understand the trade-offs well, but this is a workstation, not a server and a crash should /dev/sda ever die is acceptable. I've also got more memory than I need so I suspect I could switch swap off completely. Anyway, so far Inertia born from Igorance means I'm doing nothing about this. Expert opinions are welcome.

Everything works. Just two things I've noticed. Sometimes the system doesn't boot. It comes up in an "ash" shell, which is one of those recursive acronyms standing for "Ash is Satanic Hell". So far the only command I've mastered is "reboot", which works well - I've not had two boot failures in a row. After 10-20 reboots I think this boot failure is happening about 20-25% of the time. I've no idea how to troubleshoot it.
UPDATE: this boot problem doesn't seem to be happening since upgrading to Ubuntu 8. Perhaps the upgrade fixed some mistake in my grub.conf? Maybe related is that I no longer have an (hd1,0) section in grub.conf; they are all (hd0,0). I think I'll leave it like that: if "hd0" dies I can edit the grub boot string that once, and then edit grub.conf.

The other thing is my screen sometimes goes blank. The system is fine, it is just like the video card has died. My solution has always been: Ctrl-Alt-F1, pause for half a second, then Ctrl-Alt-F7. I guess this process resets the video card. It used to happen maybe once a day or so. Now it seems to have happening once an hour on average, and is becoming almost irritating enough that I'll have to look into it. The fact that running RAID could affect the display feels like an important clue.

Finally, my home partition is encrypted, using crypt. I have not moved it to software RAID yet, for that reason. I will really soon (as all my data that could really benefit from the security of software RAID is currently the only data not protected by it)! I will run crypt on top of software RAID, rather than the other way around; I'm just not sure the best way to do the file copy.
UPDATE: now done, it was straightforward, see Moving encrypted partition to software RAID.

4 comments:

keith.s.wilkinson said...

http://linux-raid.osdl.org/index.php/Linux_Raid

keith.s.wilkinson said...

The combination of an AMD Phenom II X3 720 with an AMD 780G+SB700 chipset mobo like the Gigabyte GA-MA78GM-US2H or GA-MA78GPM-UD2H makes a fast but budget-priced combo.
It appears that the hardware-assisted software RAID of the SB700 is supported in the LINUX kernel, see http://cateee.net/lkddb/web-lkddb/SATA_AHCI.html

keith.s.wilkinson said...

PS: Currently reading the book "LINUX Admin. 5th Edn. (2009)" by Soyinka -- exceptionally well-written description of all things LINUX. It's RH-centric, but mentions Debian/Ubuntu differences. Describes how to move a directory on a live system using LVM, but doesn't mention RAID.

allen said...

You can add one more recovery tool called Linux Recovery Software which is quite useful and effective in recovering lost, corrupted, deleted or formatted Linux partitions/Volumes and recover deleted files & folders from ext2, ext3 and ReiserFS file systems