Over the weekend, the lab I work in experienced a power outage. After power was restored, one of our servers failed to boot. It ultimately became my responsibility to figure out if the server could be repaired and failure wasn't an option because the server was configured (with no backups) to run a bunch of services and hosted lots of data (with no backups) for many users. Typical sysop problem (lol), but our lab has no personnel for managing the systems; so there I was.
In the process of finding and fixing the issue, I learned a lot of specifics regarding how the Grub bootloader and Linux work during system boot, so I decided to document my experience for future reference by others. This documentation will be lengthy, so if you only care about avoiding this type of problem, skip ahead to the Remediation section.
Finding the Problem
The server in question was a Dell Poweredge T620 running Ubuntu. The server consisted of two Intel Xeon processors, about 128GB of RAM, and three 2TB hard drives connected to a RAID controller.
The problem occurred during system start-up. The BIOS would start Grub, and then Grub would produce the error
error: attempt to read or write outside of disk hd0. After that, the Linux kernel would start, but shortly after would spit out a stack trace and crash.
My first response was to run a disk check on the hard drives to make sure there weren't any problems with their sectors, but this turned up nothing. The disks were operating normally. This meant that the problem was most likely related to Grub.
Diagnosing the Problem
Since the error message that was appearing was being generated by Grub, the next thing I did was try to manually start the Linux kernel. The easiest way to do this is to press 'c' when the Grub menu appears. This will start a Grub command shell through which operating systems can be manually booted.
Before explaining the commands I used while in the Grub command shell, I'll summarize some basics regarding partitions here.
First, hard drives and their partitions can be accessed in Linux via the "/dev" directory. In most modern Linux systems, hard drives follow a naming convention which starts with "sd" followed by a letter designating the drive. "sda", "sdb", "sdc", etc. Partition names are prefixed with a hard drive name followed by a digit designation. The hard drive "sda", for example, might have the partitions "sda1", "sda2", and so forth in the "/dev" directory.
Grub also uses the notion of disks and partitions, but the naming syntax is a little different. The syntax takes the form of "hd#,#" where the first # represents the disk number and the second # represents the partition on that disk. So "hd0,1", "hd0,2", and so on.
While I'm covering general Linux background knowledge, I'll also mention a portion of the boot sequence for Linux because this will also be important for understanding the Grub commands. In most Linux systems which boot using the BIOS (as opposed to newer EFI or UEFI booting) the critical pieces are: BIOS, Grub, initrd, and vmlinux.
BIOS stands for Basic Input Output System and is the first thing the system executes upon receiving power. Once the BIOS is started, it'll perform some basic system checks and then load and execute the bootloader. There are many bootloaders out there, but Linux systems tend to use a particular one called Grub. Grub's job is to load the pieces into memory that are necessary to start the operating system. In the case of Linux, the two important pieces which Grub needs to write into memory are initrd and vmlinux. I won't discuss these files in great detail, but to explain them briefly, initrd is the initial RAM disk for the Linux operating system. What that means is that initrd contains a minimalistic copy of Linux comprising of only the kernel and essential programs. Grub will load it into memory and then execute it and it will start up the full Linux kernel contained in the vmlinux file. Once the full Linux kernel is booted, initrd will unload itself from memory.
Now that I've covered all the necessary background information, on to the Grub commands.
The first step is to find where the boot directory is on the system. First, list all the partitions on the system:
This will create a list of all the partitions on the system. The next step is to figure out which one contains Grub, initrd, and vmlinux (since we're trying to boot a Linux operating system). This can also be done with the list command:
grub> ls (hd0,1) grub> ls (hd0,1)/boot
Once we've found the location of the boot files, set that partition as Grub's root directory:
grub> set root=(hd0,1)
I used "hd0,1" in the above commands, but you might find the boot files in a different partition. Either way, the boot files for a Linux system should always be either in the root directory of the partition (if that partition is a standalone boot partition) or in "/boot" (if that partition also holds other files).
After we've found the correct root partition, we next need to give Grub the names of the vmlinux and initrd files:
grub> /boot/vmlinux root=/dev/sda1 grub> initrd /boot/initrd
Note, the vmlinux command needs to be passed the parameter "root". This parameter should be the partition which holds the Linux operating system. This may or may not be the same partition holding the boot files.
Once all the files have been specified, all that's left is to boot the operating system:
At this point, you'll get a basic command shell for Linux. Of course, our lab's server was failing to boot, so this didn't happen. Instead, the error mentioned at the beginning appeared when I tried to execute the initrd command. Because of this, I now know that initrd is the problematic file. So how do we fix this?
So what went wrong? Basically, the problem is with the partitions on the server's hard drives. For some reason, when Grub tries to load the initrd file into memory, it reaches the end of what it can read from the hard drive before reaching the end of the file. In our case, the problem is that our RAID controller makes the hard drives appear as a single 4TB drive. This is quite large and the initrd file could reside anywhere in those 4TBs. As it turns out, Grub could not address the memory location of our initrd file and therefore couldn't load it. So the power outage was not the direct cause of our problem. At some point, most likely during an Ubuntu update, the initrd file was modified and ended up in a location on the logical hard drive which Grub can't reach.
The solution to this problem is to keep the boot files to a small partition which resides at the start of the hard drive. For those readers who skipped straight to this section, that's all you need to know. When you install a Linux system, you really should make a dedicated 256MB partition for holding the boot files, despite the fact that most Linux installers do not require you to do this.
In my case, however, I couldn't just reinstall the operating system, so in the following paragraphs I'll describe how I migrated the boot files in my already existing Linux installation into a new dedicated boot partition.
Creating a Boot Partition in an Existing Linux Installation
I started by flashing a copy of Ubuntu to a flash drive and booting it. There are plenty of tutorials on the internet about how to boot Ubuntu from a flash drive, so I'll forgo those instructions here.
The first thing I did was use gparted to create a new partition which will serve as our dedicated boot partition. Since the hard drive is already formatted, doing this requires shrinking the main partition and shifting it 256MB over. This frees up space at the start of the hard drive which can then be formatted into the new dedicated boot partition. If you've never used gparted before, I recommend using the GUI version since it's pretty intuitive. The new partition can be formatted to "ext4" and needs to have the "boot" flag enabled. Completing the shift will take awhile, so you'll probably want to do this overnight.
Once the new partition has been created, we next need to copy the existing "/boot" directory's contents over to the new partition. This can be done by mounting the two partitions while still in the Ubuntu Live USB. In the following commands, sda2 is my main partition and sda4 is the new boot partition:
sudo -s mkdir /mnt/sda2 mkdir /mnt/sda4 mount /dev/sda2 /mnt/sda2 mount /dev/sda4 /mnt/sda4 cp -R /mnt/sda2/boot/* /mnt/sda4/ rm -rf /mnt/sda2/boot/ mkdir /mnt/sda2/boot umount /mnt/sda2 umount /mnt/sda4
Finally, Linux and Grub need to be reconfigured so they know that the "/boot" directory is now in a separate partition. This can be done manually by modifying Grub's "grub.cfg" file and Linux's "/etc/fstab" file, but for simplicity, you can use Boot Repair. If you choose to go the Boot Repair route, make sure to switch the GUI into advance mode and go through all the options. Specifically, you need to make sure the "boot partition" option is set to your new boot partition and the "main operating system"" option is set to your main partition. Also, make sure the "set boot flag" option is pointed at your new boot partition and you can save a lot of time by disabling the "check filesystem for errors" option. If you instead decide to go the manual route, you'll need to manually edit "grub.cfg" looking for every "hd" reference and changing it to point at the correct partitions and the fstab file will need an additional entry to mount your new partition at start-up to the "/boot" directory.
If you do everything correctly, this should fix your Linux system and prevent similar issues from arising in the future.
Even though modern Linux installers do not require you to create a separate partition for the boot files, I recommend doing it anyway, especially if you have large hard drives. Otherwise, you might run into the problem I did.