A couple of weeks ago, I ran into a very strange and at first sight complicated problem. A physical server, running with Debian Squeeze and Software Raid, didn't start up anymore after a reboot. The troubleshooting was much more complicated too, because I didn't have a console access to this server - so I was kind of doing blind troubleshooting.
First I thought (and hoped), that fsck is probably still running as the server still wasn't up. I gave it adequate time before I contacted someone in the data center to physically take a look at the console. Then the answer from the data center guy came back: The server doesn't boot and hangs on the grub boot screen. Oh golly...
I booted the server into a rescue mode with SSH activated so I could at least take a look at the current grub configuration. I've already ran into grub2 issues in the past (see Kernel upgrade problem on Debian Squeeze) so I made my connaissance with the "device.map" file. I mounted the boot file system and took a look at it:
root@rescue /mnt/boot/grub # cat device.map
These entries mean that grub2 looks for these disks to boot on. I remembered that a couple of weeks ago I replaced a defect HDD - and that one of these entries are probably still from the old HDD. So I needed to replace the entries with new entries. I decided to completely remove the grub2 bootloader and reinstall it, to make sure, grub is also written to the first sectors of the disks:
root@rescue ~ # mkdir /mnt/rescue
root@rescue ~ # mkdir /mnt/rescue/boot
root@rescue ~ # mount /dev/vg0/root /mnt/rescue/
root@rescue ~ # mount /dev/md1 /mnt/rescue/boot
root@rescue ~ # mount /dev/vg0/var /mnt/rescue/var
root@rescue ~ # mount --bind /dev /mnt/rescue/dev/
root@rescue ~ # mount --bind /proc /mnt/rescue/proc/
root@rescue ~ # mount --bind /sys /mnt/rescue/sys/
root@rescue ~ # chroot /mnt/rescue /bin/bash
If you wonder, why I mounted the var file system: This is needed if one wants to use apt-get. And that's what I did:
root@rescue / # apt-get remove grub; apt-get purge grub; apt-get install grub
It was necessary to use "purge", otherwise some of the grub config files were still hanging around... After I answered the install questions (I installed grub on /dev/sda, /dev/sdb and /dev/md1 which, as you see, was my boot file system), I checked the device.map file again:
root@rescue / # cat /boot/grub/device.map
After these changes, the system booted again.
But how could this happen? I investigated on another, pretty similar server, which also had a recent disk replacement. The device.map also contained one old HDD entry so I ran update-grub, to update the grub configuration:
But there were no changes made to the device.map file; the old HDD entry still existed. If I were to reboot that server, it probably wouldn't start up anymore, too!
I continued some tests and got aware that if I removed device.map and _then_ launched update-grub, the file was created by update-grub and the entries were _now_ correct.
So if you replace a HDD, make sure you delete the /boot/device.map file before launching the update-grub command!
Shortly after this discovery, I filed a Debian bug report, which can be seen here: grub-update does not update device.map when hdd was replaced. Hopefully this bug will be fixed soon - or was already fixed as Debian Squeeze uses grub2 package version 1.98 and Wheezy uses 1.99.
Update March 5th 2013: I had a similar issue today when I just updated a Debian Squeeze with the latest patches and also a Kernel upgrade. The update itself went through without any error, but at the reboot, grub didn't correctly start up. Besides the steps mentioned in this post, I additionally had to manually reinstall grub on the disks:
grub-install /dev/sda; grub-install /dev/sdb
I've had several boot issues after Debian updates (not even distro upgrade!) now... I'm kind of getting scared :-/
Update February 2nd 2014:
Wow - one year later and I've had a similar experience on Debian Wheezy. See Debian not booting: ALERT /dev/disk/by-uuid does not exist.