Unable to deploy Rancher managed Kubernetes in LXC/LXD containers due to kube-proxy and nf_conntrack_max values

Written by - 0 comments

Published on July 23rd 2021 - last updated on September 17th 2021 - Listed in Docker Kubernetes Rancher Internet Cloud Coding LXC Containers


Kubernetes requires a ton of privileges, sometimes wanting to set its own Kernel parameters. This behaviour then breaks when Kubernetes is deployed (by Rancher) inside an LXC or LXD container. Due to missing kernel (net) namespace privileges, the kube-proxy deployment fails and halts the Kubernetes deployment.

From the start please

Deployment of Kubernetes in a virtual machine (VMware, Virtualbox, KVM, etc) usually works because the VM Hypervisor emulates a virtual hardware for the virtual machine. Inside the VM a full OS is installed, including its very own Linux Kernel and the Kernel modules. The OS inside the VM has full control over all Kernel parameters (sysctl for example). 

However inside a container, whether this is LXC or Docker, the surroundings are different. There is no virtual hardware emulated around the OS - and even the OS is basically a "chrooted" directory, using (and sharing) the Kernel with the host. For security reasons containers are only allowed to change certain settings inside their own namespace. And this is where the problem with Kubernetes hits; it wants to change Kernel settings which belong to the container's host.

Kubernetes in LXD (in theory)

There are a couple of tutorials available which mention that Kubernetes works inside LXD containers. A very good tutorial can be found on GitHub, written by Cornelius Weig.

The requirements to do lay the foundation for running Kubernetes inside LXD are:

  • The LXD container must load additional Kernel modules
  • The LXD container needs to drop all security related restrictions (apparmor.profile, cap.drop and devices.allow)
  • The LXD container needs to write profs and sysfs with write permissions
  • The LXD container needs to allow nested containers (security.nesting)
  • The LXD container needs to run as privileged container (security.privileged)

This results in a LXD profile (named k8s) such as this:

root@host ~ # lxc profile show k8s
config:
  limits.cpu: "4"
  limits.memory: 4GB
  limits.memory.swap: "false"
  linux.kernel_modules: ip_tables,ip6_tables,nf_nat,overlay,br_netfilter
  raw.lxc: "lxc.apparmor.profile=unconfined\nlxc.cap.drop= \nlxc.cgroup.devices.allow=a\nlxc.mount.auto=proc:rw
    sys:rw"
  security.nesting: "true"
  security.privileged: "true"
description: Kubernetes Lab Profile
devices:
  eth0:
    name: eth0
    nictype: bridged
    parent: virbr0
    type: nic
  root:
    path: /
    pool: default
    type: disk
name: k8s
used_by: []

But even this is not enough.

Once the LXD containers are created and started, the container's filesystem needs to be mounted as shared filesystem (see Kubernetes cluster provisioning fails with error response / is not a shared mount). This can be handled by using the /etc/rc.local file (create the file if it doesn't exist):

root@lxdcontainer:~# cat /etc/rc.local
#!/bin/bash

mount --make-shared /

exit

Make /etc/rc.local executable and the Systemd service rc-local should then take care of this file (see /etc/rc.local does not exist anymore).

root@lxdcontainer:~# chmod 755 /etc/rc.local
root@lxdcontainer:~# systemctl restart rc-local

And if this wouldn't be enough, recent Kubernetes versions now also require to have a /dev/kmsg device node available in the OS, so this needs to be created as well. In this case, the device node can be created as a symlink by systemd's tmpfiles.d:

root@lxdcontainer:~# echo 'L /dev/kmsg - - - - /dev/null' > /etc/tmpfiles.d/kmsg.conf

Reboot the LXD container afterwards.

Deploying a Rancher managed Kubernetes cluster

When a new Kubernetes cluster in Rancher is created, it awaits the deployments of Kubernetes inside the nodes. So let's take this created lxdcontainer and fire up the Kubernetes deployment with the docker run command seen in Rancher's user interface:

root@lxdcontainer:~# sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run  rancher/rancher-agent:v2.5.9 --server https://rancher.example.com --token lpk4b444vb7xcwtn4qpdpqv964n4rtlvbfm2xcjkkq75gjg9jzn4kq --ca-checksum 449144c1445882641ff0d66593c5a1fc2eb0fa90994e95eb46efe0b341317afa --etcd --controlplane --worker

But then after a while the reality kicks in: The cluster deployment is stuck on kube-proxy. The Rancher server logs (and the user interface) shows the following error:

2021/07/23 07:40:55 [ERROR] Failed to upgrade worker components on NotReady hosts, error: [Failed to verify healthcheck: Failed to check http://localhost:10256/healthz for service [kube-proxy] on host [192.168.15.161]: Get "http://localhost:10256/healthz": dial tcp 127.0.0.1:10256: connect: connection refused, log: F0723 07:40:32.920751   19619 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied]

A closer look into the kube-proxy container logs reveals that it tried to set a different value for the nf_conntrack_max Kernel setting:

root@lxdcontainer:~# docker logs --follow kube-proxy
[...]
W0723 07:30:08.920852   12314 server.go:226] WARNING: all flags other than --config, --write-config-to, and --cleanup are deprecated. Please begin using a config file ASAP.
I0723 07:30:08.920930   12314 feature_gate.go:243] feature gates: &{map[]}
I0723 07:30:08.921038   12314 feature_gate.go:243] feature gates: &{map[]}
W0723 07:30:08.921571   12314 proxier.go:651] Failed to read file /lib/modules/5.10.0-0.bpo.7-amd64/modules.builtin with error open /lib/modules/5.10.0-0.bpo.7-amd64/modules.builtin: no such file or directory. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0723 07:30:08.923115   12314 proxier.go:661] Failed to load kernel module ip_vs with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0723 07:30:08.924299   12314 proxier.go:661] Failed to load kernel module ip_vs_rr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0723 07:30:08.925830   12314 proxier.go:661] Failed to load kernel module ip_vs_wrr with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0723 07:30:08.927054   12314 proxier.go:661] Failed to load kernel module ip_vs_sh with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
W0723 07:30:08.928042   12314 proxier.go:661] Failed to load kernel module nf_conntrack with modprobe. You can ignore this message when kube-proxy is running inside container without mounting /lib/modules
I0723 07:30:08.943595   12314 node.go:172] Successfully retrieved node IP: 172.17.0.1
I0723 07:30:08.943627   12314 server_others.go:142] kube-proxy node IP is an IPv4 address (172.17.0.1), assume IPv4 operation
W0723 07:30:08.944553   12314 server_others.go:584] Unknown proxy mode "", assuming iptables proxy
I0723 07:30:08.944629   12314 server_others.go:182] DetectLocalMode: 'ClusterCIDR'
I0723 07:30:08.944652   12314 server_others.go:185] Using iptables Proxier.
I0723 07:30:08.944760   12314 proxier.go:287] iptables(IPv4) masquerade mark: 0x00004000
I0723 07:30:08.944857   12314 proxier.go:334] iptables(IPv4) sync params: minSyncPeriod=1s, syncPeriod=30s, burstSyncs=2
I0723 07:30:08.944977   12314 proxier.go:346] iptables(IPv4) supports --random-fully
I0723 07:30:08.945162   12314 server.go:650] Version: v1.20.8
I0723 07:30:08.945589   12314 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
F0723 07:30:08.945634   12314 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

The LXD container (lxdcontainer) has no privileges to change this setting as it comes from the host:

root@lxdcontainer:~# sysctl -w net.netfilter.nf_conntrack_max=131072
sysctl: setting key "net.netfilter.nf_conntrack_max"

root@lxdcontainer:~# sysctl -a |grep nf_conntrack_max
net.netfilter.nf_conntrack_max = 524288

The value remains at the host-given value.

The reason for this seems to be a changed behaviour in the Linux Kernel (since 4.10), disallowing non init network namespaces to change these settings (the network of the LXD container was not initiated inside the container's namespace, therefore it is not considered a network initiating namespace).

What is very annoying however is the following fact: The existing value is set to  524288 - which is higher than what kube-proxy wants to set: 131072. Why would kube-proxy bother to set nf_conntrack_max to a specific value, when a currently higher value is already defined?

Bug reporting and Kubernetes PR

At first a bug inside Rancher itself was assumed. A similar problem also happens when Rancher (the application) itself is deployed as single docker installation and the value is changed after the initial installation. Hence issue #33360 was created in the Rancher repositories. Deeper analysis revealed that this problem occurs inside the conntrack.go file, which is part of the kube-proxy application:

func (realConntracker) setIntSysCtl(name string, value int) error {
    entry := "net/netfilter/" + name

    sys := sysctl.New()
    if val, _ := sys.GetSysctl(entry); val != value {
        klog.Infof("Set sysctl '%v' to %v", entry, value)
        if err := sys.SetSysctl(entry, value); err != nil {

            return err
        }
    }
    return nil
}

Here we can see an if condition which compares the current system value (val) with the (internally) expected value (value). If they don't match, the function sys.SetSysctl is launched and then tries to set a new value for the setting.

As mentioned before, this would make sense when Kubernetes requires a minimum value but the current system value is lower than that minimum. However when the system value is higher than the expected value, Kubernetes should just accept this and move on. This is the basic description of the Kubernetes Pull Request #103174, which should cover that "workaround".

kube-proxy pull request for nf_conntrack_max workaround

Once this PR gets merged into the kube-proxy code, a situation like this should just be ignored and kube-proxy deployment won't be stuck anymore - ergo Rancher Kubernetes deployment continues.

Let's keep the fingers crossed that this PR soon makes it into Kubernetes and then also into a newer Rancher release.

By the way: Other workarounds which go into the exact same direction are already in place, however only for the nf_conntrack_hashsize setting. Looking at the very same conntrack.go file shows the following comment:

    // Linux does not support writing to /sys/module/nf_conntrack/parameters/hashsize
    // when the writer process is not in the initial network namespace
    // (https://github.com/torvalds/linux/blob/v4.10/net/netfilter/nf_conntrack_core.c#L1795-L1796).
    // Usually that's fine. But in some configurations such as with github.com/kinvolk/kubeadm-nspawn,
    // kube-proxy is in another netns.
    // Therefore, check if writing in hashsize is necessary and skip the writing if not.
    hashsize, err := readIntStringFile("/sys/module/nf_conntrack/parameters/hashsize")
    if err != nil {
        return err
    }
    if hashsize >= (max / 4) {
        return nil
    }

PR merged -> Kubernetes 1.23 will have it

Great news! The mentioned Kubernetes Pull Request has now been merged with a milestone of 1.23. That means that starting with Kubernetes 1.23, kube-proxy should be able to start in containers running recent Kernel versions. As long as the current Kernel sysctl values are equal or higher than what Kubernetes is expected, this should work. At least seen with Ubuntu 20.04, the defaults are already higher than what kube-proxy attempts to set.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.