Lessons learned: Do not put stateful Minio into a Docker container

Written by Claudio Kuenzler - 0 comments

Published on May 7th 2018 - Listed in Linux Containers LXC Docker Minio

On a Big Data playground we've built (Spark, R, Minio) on top of a Rancher managed Docker environment, we've also put the Minio object storage into Docker containers. We were sceptic at first (don't create stateful containers!) but Rancher offers Minio directly from it's Cattle catalog:

We decided to go for it - it's a playground environment and in the worst case we'll begin from scratch again.

At first everything seemed to work fine. The containers started up and by using their internal Docker IP addresses, Minio was launched on each container (on four containers, spread on four physical hosts). On each physical host, three volumes were created which were volume-mounted into the Minio containers. Inside the containers:

# docker inspect -f '{{ .Mounts }}' 5ca3f3177e27
[{volume minio_minio-scheduler-setting_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_2b6e4 /var/lib/docker/volumes/minio_minio-scheduler-setting_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_2b6e4/_data /opt/scheduler local rw true } {volume minio_minio-data-0_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_76e19 /var/lib/docker/volumes/minio_minio-data-0_3_bc048e5e-2cbd-4f5d-8ab7-ef835e7424af_76e19/_data /data/disk0 local rw true } {volume rancher-cni /var/lib/docker/volumes/rancher-cni/_data /.r local ro false } {volume 42d6f216267ced36e186b3e082ae6b3c7ad53085432326220c583ab022842826 /var/lib/docker/volumes/42d6f216267ced36e186b3e082ae6b3c7ad53085432326220c583ab022842826/_data /data local true }]

Minio itself was launched inside the container in the following way:

bash-4.3# ps auxf
PID   USER     TIME   COMMAND
    1 root       0:00 s6-svscan -t0 /var/run/s6/services
   38 root       0:00 s6-supervise s6-fdholderd
323 root       0:00 s6-supervise minio
326 root       0:00 sh ./run
330 minio      0:00 sh -c /opt/minio/bin/minio server http://10.42.84.239/data/disk0 http://10.42.224.2/data/disk0
332 minio      0:27 /opt/minio/bin/minio server http://10.42.84.239/data/disk0 http://10.42.224.2/data/disk0 http:/
430 root       0:00 bash
436 root       0:00 ps auxf

The command line of this process (PID 330) reveals more:

bash-4.3# cat /proc/330/cmdline
sh-c/opt/minio/bin/minio server http://10.42.84.239/data/disk0 http://10.42.224.2/data/disk0 http://10.42.55.111/data/disk0 http://10.42.116.171/data/disk0

And, no surprise, here we can find the four IP addresses of the four Minio containers.

We launched some basic tests to make sure the setup would hold:

Upscaling: We tried to scale from 4 to 5 minio-server containers. The 5th server didn't come up. This is most likely because it is not aware of the other minio-server containers (this is only true during the setup through the Catalog) and it requires the correct start command with all the other container IP's. Even if the 5th server would have come up, there is no way that the other containers are suddenly aware of a new Minio server. They, too, need to change the command line how Minio is started. TL;DR: Minio server's (deployed through the catalog) cannot be horizontally scaled, they're fixed once deployed.
Same IP: As we can see in the command line, the IP addresses are fixed in the Minio start command. We deleted a container, it was re-created by Rancher and it kept the same IP address.
Volumes: We wanted to see if the data is lost if a container is stopped or deleted. Even after deletion of a minio-server container, the volumes still exist on the Rancher host and when the container is re-created, the same volume is re-attached to the new container. Therefore no data loss.

So far so good but one thing caught my eye: The minio server service, automatically built by deploying from the catalog, was set to "Scale" and not to "Global". The latter one, global, is usually my first choice to run services across hosts. And Minio, by definition, is such a service.

When we had to take down physical host #1 for maintenance, the problems started. Due to the "Scaling" of the Minio Server Service, another Minio container was launched on one of the other three still remaining physical hosts. Therefore on Docker host #2 we had two Minio containers running. Even though the newly created container on host #2 used the same IP address, it had to create a new data volume on the host. Therefore the two Minio containers now use the double amount of disk space as before.

When host #1 was up again, we tried to place the Minio container on host #1 again. Unfortunately in Rancher they cannot be moved like a VM in VMware. So we deleted the second Minio container on host #2 and hoped it would automatically create a new Minio container (scale: 4, remember) on host #1 again. It didn't. The new Minio container was again launched on host #2. I had to downscale and then upscale again but this worked: The new Minio container was finally created again on host #1. But to my shock suddenly a new IP address was assigned instead of using the previous one (probably due to my down scaling). This means that communication with the other Minio containers was broken because they were not aware of an IP change of one of their Minio partner nodes.

Another problem was also that on node #2 the doubled volumes are still on the system, using disk space:

# docker volume ls | grep minio
local               minio_minio-data-0_1_fb18fef0-a34b-4f78-9628-ca66abe4b057_a745e
local               minio_minio-data-0_2_c416a26e-b724-49af-bd37-751813a0a586_10db2
local               minio_minio-scheduler-setting_1_fb18fef0-a34b-4f78-9628-ca66abe4b057_919c9
local               minio_minio-scheduler-setting_2_c416a26e-b724-49af-bd37-751813a0a586_fd842

At first we had some "hope" that the service would itself recover again, when each container was restarted - but the following error message shown in the Minio user interface was pretty much saying nay (Multiple disk failures, unable to write data.):

Minio as Docker container: Multiple disk failures

Luckily we were still in playground mode before some actual/important data was created and stored in the Minio object store. I decided to create static LXC containers for Minio and use the docker0 bridge in order to connect the (stateless) application Docker containers to Minio.
I was able to attach the LXC container "minio01" to the docker0 bridge and the container was able to communicate with the Docker containers on the same physical host. But, and this is the show stopper, it was not able to communicate with the Docker containers running on other hosts (but in the same Rancher environment). The reason for this is that Rancher uses some magic (technically speaking: An IPSEC tunnel between the Docker hosts) to "extend" the Docker-internal network (10.42.0.0/16) on to each Docker host. According to the Rancher documentation, it is possible to "join" this IPSEC tunneled network by manually running a Docker container with a specific label:

For any containers launched through the Docker CLI, an extra label --label io.rancher.container.network=true can be used to launch the container into the managed network. Without this label, containers launched from the Docker CLI will be using the bridge network.

But of course this only applies to containers started and managed by Docker, not the Linux Containers LXC (System Containers as they're nowadays called).

The Docker containers, have one thing in common though: They use the Docker host as router/gateway. This means they are able to connect to the physical network using iptables NAT rules. So I created another virtual bridge called virbr0 alongside docker0, using the physical primary interface of the Docker host and used the host's IP address to now assign to the virbr0 interface.

# The primary network interface
iface enp15s0 inet manual

# Virtual bridge
auto virbr0
iface virbr0 inet static
        address 10.150.1.10
        netmask 255.255.255.128
        network 10.150.1.0
        broadcast 10.150.1.127
        gateway 10.150.1.1
        # bridge control
        bridge_ports enp15s0
        bridge_fd 0
        pre-up brctl addbr virbr0
        pre-down brctl delbr virbr0
        # dns-* options are implemented by the resolvconf package, if installed
        dns-nameservers 8.8.8.8 1.1.1.1
        dns-search playground.local

The LXC container is now using this virtual bridge virbr0 as network link and the own host's ip address as gateway:

# cat /var/lib/lxc/minio01/config
# Template used to create this container: /usr/share/lxc/templates/lxc-ubuntu
# Parameters passed to the template: --release xenial
# Template script checksum (SHA-1): 4d7c613c3c0a0efef4b23917f44888df507e662b
# For additional config options, please look at lxc.container.conf(5)

# Uncomment the following line to support nesting containers:
#lxc.include = /usr/share/lxc/config/nesting.conf
# (Be aware this has security implications)

# Common configuration
lxc.include = /usr/share/lxc/config/ubuntu.common.conf

# Container specific configuration
lxc.rootfs = /var/lib/docker/lxc/minio01
lxc.rootfs.backend = dir
lxc.utsname = minio01
lxc.arch = amd64
lxc.start.auto = 1

# Network configuration
lxc.network.type = veth
lxc.network.link = virbr0
lxc.network.flags = up
lxc.network.hwaddr = 00:16:3e:10:ee:90
lxc.network.ipv4 = 10.150.1.50/25
lxc.network.ipv4.gateway = 10.150.1.10

Now the Docker containers and the LXC container have one thing in common: They use the same physical host as gateway. This means, they're using the same routing table and therefore are able to communicate with each other.

But how do I tell the Docker application containers where Minio is located now? In Rancher this is possible by using "External Service" in which you can set external IP addresses, outside of the Docker container range:

And this external service can now be used in the internal Rancher load balancer:

Although Minio isn't running inside the Docker environment managed by Rancher anymore, it still uses the same physical network connections (gateways) of the same hosts and therefore not losing any performance. On the other hand we feel much more comfortable and "safe" when the persistent data is running in a LXC container.

Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

Blog Tags:

AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Icingaweb Icingaweb2 Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder