AWS EC2 instances not booting after Ubuntu distribution upgrade (grub install to wrong NVMe device)

Written by - 0 comments

Published on - Listed in AWS Cloud Linux


Ubuntu distribution upgrades, for example from 18.04 Bionic to 20.04 Focal, are most of the times pretty painless. Sure, there are sometimes a couple of major software upgrades which require configuration adjustments, but in general the upgraded system should boot.

EC2 not booting after dist-upgrade

On AWS however the story is a little different. In the past few weeks, a bunch of EC2 instances needed to be upgraded from Ubuntu 18.04 to 20.04 - yet three out of four (that's 3/4) machines did not come back up after a final reboot! The instance screenshot would show that the machine landed in grub rescue:

EC2 instance not booting after Ubuntu distribution upgrade

Because all of the affected EC2 instances were Kubernetes cluster nodes, it was faster to just deploy a new EC2 instance and join the Kubernetes cluster again, than trying to fix the upgraded EC2 instance. And yes, connecting to an EC2 console still is very annoying and requires relatively a lot of effort, compared to other cloud providers (e.g. upCloud).

GRUB failed to install

During the latest dist-upgrade of yet another EC2 instance, something different happened towards the end of the upgrade: Grub asked on which disk to be installed. It offered two disks: /dev/nvme0n1 and /dev/nvme1n1.

This and all the other EC2 instances, previously upgraded, have the same configuration: A primary disk (EBS) and a secondary disk (EBS), dedicated for containers and mounted on /var/lib/docker.

Of course I went with the default choice, to install grub on the first disk, /dev/nvme0n1. But - big surprise - that didn't work:

grub install failed on ec2 instance

With the dist-upgrade done and with the machine still up, the NVMe disks are verified:

root@ubuntu:~# ll /dev/nvme*
crw------- 1 root root 246, 0 Nov 11 16:06 /dev/nvme0
brw-rw---- 1 root disk 259, 1 Nov 11 16:06 /dev/nvme0n1
crw------- 1 root root 246, 1 Nov 11 16:06 /dev/nvme1
brw-rw---- 1 root disk 259, 0 Nov 11 16:06 /dev/nvme1n1
brw-rw---- 1 root disk 259, 2 Nov 11 16:06 /dev/nvme1n1p1

So here we have the two NVMe drives. But which one is which?

root@ubuntu:~# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 50 GiB, 53687091200 bytes, 104857600 sectors
Disk model: Amazon Elastic Block Store              
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Oh? There is no boot partition on this drive. And it's a 50GB drive - that means that this is the secondary EBS disk!

Let's check out the, according to this machine, second NVMe drive:

root@ubuntu:~# fdisk -l /dev/nvme1n1
Disk /dev/nvme1n1: 30 GiB, 32212254720 bytes, 62914560 sectors
Disk model: Amazon Elastic Block Store              
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x11c9238f

Device         Boot Start      End  Sectors Size Id Type
/dev/nvme1n1p1 *     2048 62914526 62912479  30G 83 Linux

Here we go; this is our primary EBS disk of 30 GB with the boot partition.

So the actual boot device is /dev/nvme1n1, not /dev/nvme0n1! Turns out the EC2 machine turned the ordering of the drives around - resulting in a grub install on the wrong device on an automated dist-upgrade.

Manually install GRUB on the correct device

To fix this, the Grub boot loader can be installed manually on the correct (boot) device - which is /dev/nvme1n1 in this situation:

root@ubuntu:~# grub-install /dev/nvme1n1
Installing for i386-pc platform.
Installation finished. No error reported.

The EC2 instance can now be rebooted and the machine comes back up again.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.