These days I'm testing Rancher as a potential candidate for a new Docker infrastructure. It's appealing so far: Rancher does have a nice and intuitive user interface and more importantly a nice API to automatically trigger container creation (for example through Travis).
During a failover test, I rebooted one of the Rancher hosts and when it came back up, the connectivity to Rancher was lost. Why? Because I forgot to add the separate file system for /var/lib/docker, which I prepared as a logical volume, into /etc/fstab - therefore all previous docker data was gone and of course also the rancher-agent container.
Unfortunately I didn't see the error as fast and I just decided to simply remove the host in Rancher and re-add it manually. Of course when I fixed the file system mount problem and rebooted, Rancher would not connect anymore, because meanwhile there is a new rancher-agent with a new ID installed.
To force a reset or cleanup of the Rancher host, one can do the following:
1. Deactivate the affected host in Rancher, then remove the host
2. Stop Docker service
service docker stop
3. Remove Docker and Rancher data:
rm -rf /var/lib/docker/*
rm -rf /var/lib/rancher/*
4. Start Docker service
service docker start
5. Add the host in Rancher
The above commands apply to a Rancher 1.x environment. In Rancher 2.x more directories must be cleaned up:
1. Deactivate (drain) the affected host in Rancher, then remove the host. Either in the Rancher UI or for the "local" cluster in RKE's YAML config.
2. Stop Docker service
service docker stop
3. Remove Docker, Rancher, RKE and Kubernetes related data:
mount|grep kubelet | awk '{print $3}' | while read mount; do umount $mount; done
rm -rf /var/lib/docker/*
rm -rf /var/lib/rancher/*
rm -rf /var/lib/etcd
rm -rf /var/lib/kubelet/*
rm -rf /etc/kubernetes
rm -rf /etc/cni
rm -rf /opt/cni
rm -rf /var/lib/cni
rm -rf /var/run/calico
rm -rf /run/secrets/kubernetes.io
test -d /opt/rancher && rm -rf /opt/rancher # For Single Rancher installs
test -d /opt/containerd && rm -rf /opt/containerd
test -d /opt/rke && rm -rf /opt/rke
4. Restart Docker service
service docker restart
Yes, although the Docker service was previously stopped, a simple "start" does not re-create the directories within /var/lib/docker (since Docker 20.10.x; see article Docker unable to pull images after-clean up for more information):
root@node:~# service docker start
root@node:~# ll /var/lib/docker/
total 0
A service restart however re-creates the missing directories:
root@node:~# service docker restart
root@node:~# ll /var/lib/docker/
total 44
drwx--x--x 4 root root 4096 Nov 11 14:06 buildkit
drwx--x--- 2 root root 4096 Nov 11 14:06 containers
drwx------ 3 root root 4096 Nov 11 14:06 image
drwxr-x--- 3 root root 4096 Nov 11 14:06 network
drwx--x--- 3 root root 4096 Nov 11 14:06 overlay2
drwx------ 4 root root 4096 Nov 11 14:06 plugins
drwx------ 2 root root 4096 Nov 11 14:06 runtimes
drwx------ 2 root root 4096 Nov 11 14:06 swarm
drwx------ 2 root root 4096 Nov 11 14:06 tmp
drwx------ 2 root root 4096 Nov 11 14:06 trust
drwx-----x 2 root root 4096 Nov 11 14:06 volumes
5. Add the host into a cluster using the sudo docker... command (shown in Rancher UI) or in RKE YAML
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Container Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Icingaweb Icingaweb2 InfluxDB Internet Java KVM Kibana Kodi Kubernetes LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance SystemD Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder