Docker upgrade on Kubernetes node (Ubuntu) failed: The aufs storage-driver is no longer supported

Written by Claudio Kuenzler - 0 comments

Published on November 11th 2021 - Listed in Rancher Kubernetes Docker Cloud Containers

While upgrading an Ubuntu 18.04 machine (and preparing for 20.04 dist upgrade), the upgrade process failed at the docker.io package with the following error:

Selecting previously unselected package docker.io.
(Reading database ... 173993 files and directories currently installed.)
Preparing to unpack .../docker.io_20.10.7-0ubuntu5~18.04.3_amd64.deb ...
The aufs storage-driver is no longer supported.
Please ensure that none of your containers are
using the aufs storage driver, remove the directory
/var/lib/docker/aufs and try again.
dpkg: error processing archive /var/cache/apt/archives/docker.io_20.10.7-0ubuntu5~18.04.3_amd64.deb (--unpack):
new docker.io package pre-installation script subprocess returned error exit status 1
Errors were encountered while processing:
/var/cache/apt/archives/docker.io_20.10.7-0ubuntu5~18.04.3_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

As this particular machine is part of a Kubernetes cluster from 2018, back then "aufs" was the default storage driver for the Docker engine. overlay and overlay2 followed after aufs, but a storage driver migration never happened.

Fixing the stuck upgrade

Obviously the apt dist-upgrade process is stuck at this point. At this point, the docker.io package was removed and a manual installation of docker.io was attempted. But this resulted in - again - the same error as the upgrade package.

Next attempt: Doing what the error message implies but do a backup before. An additional disk was attached to this machine, formatted and mounted at /mnt:

root@node1:~# systemctl stop docker
root@node1:~# mount /dev/sdc1 /mnt
root@node1:~# cp -au /var/lib/docker /mnt/docker.bk

Now let's remove the aufs directory:

root@node1:~# rm -rf /var/lib/docker/aufs/
rm: cannot remove '/var/lib/docker/aufs/mnt/7c5f34b4819b30c20893e46ea72bfb1e776ccc157847bc862e604edecbc7f717': Is a directory
rm: cannot remove '/var/lib/docker/aufs/mnt/a00586d3ef52026693f2602c29bd310f2447bab14162127946e78eb05dcf6e13': Is a directory
rm: cannot remove '/var/lib/docker/aufs/mnt/391e1c7bfd9a3dd83f4d5efecd16359d26f1c9ec14a710aac2d53be90215f667': Is a directory
rm: cannot remove '/var/lib/docker/aufs/mnt/a00ebceab4eb3d06678ed11b86807bd8adac0fd0d8536b399264785ead99981d': Is a directory
rm: cannot remove '/var/lib/docker/aufs/mnt/cc8a4f3e00de7293e11c3da9bdb16552144439f7e33178fcdacb70a83cb38b4c': Is a directory
rm: cannot remove '/var/lib/docker/aufs/mnt/f04fd3244cb8b01b2c27e63840ed72e5207cab6b41430f63a8d51411ac7d4782': Is a directory
rm: cannot remove '/var/lib/docker/aufs/mnt/d2b9fd29d5e32fd6935ad64928840f55787aee5c8dfc66f8793125129f8bb27e': Is a directory
rm: cannot remove '/var/lib/docker/aufs/mnt/881eac79990a998ea0bf25c9c2257201a45364f61cb6a557d87ca6a1ae7db0df': Is a directory

Even though the Docker service was stopped and the aufs directory checked with lsof (which was unable to stat these directories, as it actually needs Docker to be running), certain directories could not be removed. The machine required a reboot to fully "release" these aufs mounted directories:

root@node1:~# reboot
root@node1:~# rm -rf /var/lib/docker/aufs/

This time it worked.

Before installing the docker.io package again, let's make sure that overlay2 will be used as storage-driver. This can be added in the /etc/docker/daemon.json file:

root@node1:~# cat /etc/docker/daemon.json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "5"
},
"storage-driver": "overlay2"
}

Install docker.io again:

root@node1:~# apt-get install docker.io

This worked without any error. Let's check the currently used storage driver of the Docker engine:

root@node1:~# docker info
Client:
Context: default
Debug Mode: false

Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 20.10.7
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
[...]

This looks fine - but what about the containers which are supposed to be running on this machine? The Docker service is now running but without any containers. The node therefore is unaware that it is part of a Kubernetes cluster. In the Kubernetes overview this machine (cluster node) is seen as unavailable.

There is no aufs to overlay migration strategy!

The official Docker overlayfs documentation just mentions how to (manually) define overlay or overlay2 as storage driver in the Docker engine. However there are no words how to migrate containers from aufs to overlay/2. Even with a backup of the containers available (in /mnt/docker.bk), there is no (documented) way to migrate the aufs containers to overlayfs.

Did this catch me on the wrong foot? - Definitely.

Am I the only one? - Not at all. Ubuntu bug #1939106 is exactly about this package upgrade problem.

Does it makes sense to switch to overlayfs? - Yes, not only from a performance point of view. Actually, if the affected machine wouldn't have been in use in a Kubernetes cluster since 2018, the newer overlayfs driver would have been the default (seen by doing OS upgrades on other/newer clusters without any problems).

What do others do in this situation? - Some hint to save the images and load them back in after switching to overlayfs, but this only works for images but not for the containers (and data). Others have their containers (manually) setup using a docker-compose configuration. This makes sure to pull the required images again. But here we are in a Kubernetes cluster situation. The node is supposed to connect back to the cluster and retrieve instructions from the kube-api. But it won't do, because a kubelet container is not running.

The Kubernetes way: Remove node and add back to cluster again

While evaluating different scenarios, including an image and container backup, described in this excellent article by William Meleyal, I finally decided to go with the typical cluster approach: Remove the node from the cluster, reset it and add it back to the cluster.

As this machine is part of the "local" cluster of a Rancher managed Kubernetes cluster (this means that Rancher itself runs on this cluster), RKE needs to be used to remove the cluster node. The cluster yaml was adjusted, disabling the machine in question by commenting it, and then rke up was run:

ck@mgmt:~/rancher$ ./rke_linux-amd64-1.3.1 up --config 3-node-rancher-n111.yml
[...]
INFO[0036] [remove/etcd] Removing member [etcd-192.168.253.15] from etcd cluster
INFO[0037] [remove/etcd] Checking etcd cluster health on [etcd-192.168.253.16] after removing [etcd-192.168.253.15]
INFO[0038] [etcd] etcd host [192.168.253.16] reported healthy=true
INFO[0038] [remove/etcd] etcd cluster health is healthy on [etcd-192.168.253.16] after removing [etcd-192.168.253.15]
INFO[0038] [remove/etcd] Successfully removed member [etcd-192.168.253.15] from etcd cluster
INFO[0038] [hosts] host [192.168.253.15] has another role, skipping delete from kubernetes cluster
INFO[0038] [dialer] Setup tunnel for host [192.168.253.15]
INFO[0054] [dialer] Setup tunnel for host [192.168.253.15]
INFO[0059] [dialer] Setup tunnel for host [192.168.253.15]
WARN[0063] [reconcile] Couldn't clean up etcd node [192.168.253.15]: Not able to reach the host: Can't retrieve Docker Info: error during connect: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info": Unable to access the service on /var/run/docker.sock. The service might be still starting up. Error: ssh: rejected: connect failed (open failed)
INFO[0063] [reconcile] Check etcd hosts to be added
INFO[0063] [hosts] host [192.168.253.15] has another role, skipping delete from kubernetes cluster
INFO[0063] [dialer] Setup tunnel for host [192.168.253.15]
INFO[0068] [dialer] Setup tunnel for host [192.168.253.15]
INFO[0073] [dialer] Setup tunnel for host [192.168.253.15]
WARN[0077] [reconcile] Couldn't clean up worker node [192.168.253.15]: Not able to reach the host: Can't retrieve Docker Info: error during connect: Get "http://%2Fvar%2Frun%2Fdocker.sock/v1.24/info": Unable to access the service on /var/run/docker.sock. The service might be still starting up. Error: ssh: rejected: connect failed (open failed)
INFO[0077] [hosts] Cordoning host [192.168.253.15]
INFO[0077] [hosts] Deleting host [192.168.253.15] from the cluster
INFO[0077] [hosts] Successfully deleted host [192.168.253.15] from the cluster
[...]
INFO[0327] Finished building Kubernetes cluster successfully

After rke finished, only two nodes remained in the cluster.

Now our cluster node can be properly reset and cleaned up to be added back into the cluster.

Once this is done, the cluster yaml is adjusted again, this time enabling our problematic machine again. Followed by another rke run:

ck@mgmt:~/rancher$ ./rke_linux-amd64-1.3.1 up --config 3-node-rancher-n111.yml
INFO[0000] Running RKE version: v1.3.1
INFO[0000] Initiating Kubernetes cluster
[...]
INFO[0036] [reconcile] Reconciling cluster state
INFO[0036] [reconcile] Check etcd hosts to be deleted
INFO[0036] [reconcile] Check etcd hosts to be added
INFO[0038] [add/etcd] Adding member [etcd-192.168.253.15] to etcd cluster
INFO[0039] [add/etcd] Successfully Added member [etcd-192.168.253.15] to etcd cluster
[...]
INFO[0090] Pre-pulling kubernetes images
INFO[0090] Pulling image [rancher/hyperkube:v1.20.11-rancher1] on host [192.168.253.15], try #1
[...]
INFO[0214] Starting container [kube-apiserver] on host [192.168.253.15], try #1
INFO[0215] [controlplane] Successfully started [kube-apiserver] container on host [192.168.253.15]
INFO[0215] [healthcheck] Start Healthcheck on service [kube-apiserver] on host [192.168.253.15]
INFO[0225] [healthcheck] service [kube-apiserver] on host [192.168.253.15] is healthy
[...]
INFO[0370] Finished building Kubernetes cluster successfully

A few seconds after rke finished, our node1 appeared in the "local" cluster again.

TL;DR: From aufs to overlayfs in a Kubernetes cluster

When you need to migrate from aufs to overlayfs (obviously required if you still use aufs and installing Docker 20.10.x) and this machine is part of a Kubernetes cluster, use the following steps:

Drain the cluster node
Remove the node from the cluster
Run a full Rancher reset/clean up on this node
Make sure docker info shows overlay2 as storage driver, if it doesn't adjust /etc/docker/daemon.json and restart Docker
Optional: Now upgrade the docker.io package or the full OS (apt dist-upgrade)
Add the node back to the cluster

Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

Blog Tags:

AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder