How to solve Kubernetes upgrade in Rancher 2 failing with remote error: tls: bad certificate

Written by - 0 comments

Published on January 28th 2021 - Listed in Docker Kubernetes Rancher Internet Cloud


When an (old) Rancher 2 managed Kubernetes cluster needed to be upgraded from Kubernetes 1.14 to 1.15, the upgrade failed. Even when all the steps from the tutorial How to upgrade Kubernetes version in a Rancher 2 managed cluster were followed. What happened?

rke ends with remote error on bad certificate

To enforce another (or newer) Kubernetes version, the original 3-node rke configuration yaml can be adjusted to specify the Kubernetes version to be used:

ck@linux:~/rancher$ cat RANCHER2_STAGE/3-node-rancher-stage.yml
nodes:
  - address: 10.10.153.15
    user: ansible
    role: [controlplane,etcd,worker]
    ssh_key_path: ~/.ssh/id_rsa
  - address: 10.10.153.16
    user: ansible
    role: [controlplane,etcd,worker]
    ssh_key_path: ~/.ssh/id_rsa
  - address: 10.10.153.17
    user: ansible
    role: [controlplane,etcd,worker]
    ssh_key_path: ~/.ssh/id_rsa

kubernetes_version: "v1.15.12-rancher2-2"

services:
  etcd:
    snapshot: true
    creation: 6h
    retention: 24h

Note: Please read the special notes on How to upgrade Kubernetes version in a Rancher 2 managed cluster which Kubernetes version to use - it depends on the RKE version.

With this minor change in the yaml config, rke up can be launched against the cluster and the cluster should upgrade the Kubernetes version (all other settings remain in place). In the following output rke 1.1.2 was used against this staging cluster:

ck@linux:~/rancher$ ./rke_linux-amd64-1.1.2 up --config RANCHER2_STAGE/3-node-rancher-stage.yml
INFO[0000] Running RKE version: v1.1.2                   
INFO[0000] Initiating Kubernetes cluster                 
INFO[0000] [state] Possible legacy cluster detected, trying to upgrade  
INFO[0000] [reconcile] Rebuilding and updating local kube config
[...]
INFO[0056] Pre-pulling kubernetes images 
INFO[0056] Pulling image [rancher/hyperkube:v1.15.12-rancher2] on host [10.10.153.15], try #1  
INFO[0056] Pulling image [rancher/hyperkube:v1.15.12-rancher2] on host [10.10.153.17], try #1  
INFO[0056] Pulling image [rancher/hyperkube:v1.15.12-rancher2] on host [10.10.153.16], try #1

INFO[0079] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.17]  
INFO[0083] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.16]  
INFO[0088] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.15]  
INFO[0088] Kubernetes images pulled successfully         
INFO[0088] [etcd] Building up etcd plane..
[...]
INFO[0167] [etcd] Successfully started etcd plane.. Checking etcd cluster health  
WARN[0197] [etcd] host [10.10.153.15] failed to check etcd health: failed to get /health for host [10.10.153.15]: Get https://10.10.153.15:2379/health: remote error: tls: bad certificate
WARN[0223] [etcd] host [10.10.153.16] failed to check etcd health: failed to get /health for host [10.10.153.16]: Get https://10.10.153.16:2379/health: remote error: tls: bad certificate
WARN[0240] [etcd] host [10.10.153.17] failed to check etcd health: failed to get /health for host [10.10.153.17]: Get https://10.10.153.17:2379/health: remote error: tls: bad certificate
FATA[0240] [etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [10.10.153.15,10.10.153.16,10.10.153.17] failed to report healthy. Check etcd container logs on each host for more information

The newer Kubernetes 1.15.12 images were successfully pulled. But at the end the Kubernetes upgrade failed due to a bad certificate.

Rotate the Kubernetes certificates

Unfortunately the error message does not contain the exact reason, why the Kubernetes certificates would be shown as "bad". But the most plausible reason is that the certificates expired. A similar problem was already discovered once and documented in Rancher 2 Kubernetes certificates expired! How to rotate your expired certificates. rke can be used to rotate and renew the Kubernetes certificates:

ck@linux:~/rancher$ ./rke_linux-amd64-1.1.2 cert rotate --config RANCHER2_STAGE/3-node-rancher-stage.yml
INFO[0000] Running RKE version: v1.1.2                   
INFO[0000] Initiating Kubernetes cluster                 
INFO[0000] Rotating Kubernetes cluster certificates      
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates  
INFO[0000] [certificates] Generating Kubernetes API server certificates  
INFO[0000] [certificates] Generating Kube Controller certificates  
INFO[0000] [certificates] Generating Kube Scheduler certificates  
INFO[0000] [certificates] Generating Kube Proxy certificates  
INFO[0001] [certificates] Generating Node certificate    
INFO[0001] [certificates] Generating admin certificates and kubeconfig  
INFO[0001] [certificates] Generating Kubernetes API server proxy client certificates  
INFO[0001] [certificates] Generating kube-etcd-192-168-253-15 certificate and key  
INFO[0001] [certificates] Generating kube-etcd-192-168-253-16 certificate and key  
INFO[0001] [certificates] Generating kube-etcd-192-168-253-17 certificate and key  
INFO[0002] Successfully Deployed state file at [RANCHER2_STAGE/3-node-rancher-stage.rkestate]
INFO[0002] Rebuilding Kubernetes cluster with rotated certificates  
INFO[0002] [dialer] Setup tunnel for host [10.10.153.15]  
INFO[0002] [dialer] Setup tunnel for host [10.10.153.16]  
INFO[0002] [dialer] Setup tunnel for host [10.10.153.17]  
INFO[0010] [certificates] Deploying kubernetes certificates to Cluster nodes
[...]
INFO[0058] [controlplane] Successfully restarted Controller Plane..  
INFO[0058] [worker] Restarting Worker Plane..            
INFO[0058] Restarting container [kubelet] on host [10.10.153.17], try #1  
INFO[0058] Restarting container [kubelet] on host [10.10.153.15], try #1  
INFO[0058] Restarting container [kubelet] on host [10.10.153.16], try #1  
INFO[0060] [restart/kubelet] Successfully restarted container on host [10.10.153.17]  
INFO[0060] Restarting container [kube-proxy] on host [10.10.153.17], try #1  
INFO[0060] [restart/kubelet] Successfully restarted container on host [10.10.153.16]  
INFO[0060] Restarting container [kube-proxy] on host [10.10.153.16], try #1  
INFO[0060] [restart/kubelet] Successfully restarted container on host [10.10.153.15]  
INFO[0060] Restarting container [kube-proxy] on host [10.10.153.15], try #1  
INFO[0066] [restart/kube-proxy] Successfully restarted container on host [10.10.153.17]  
INFO[0066] [restart/kube-proxy] Successfully restarted container on host [10.10.153.16]  
INFO[0073] [restart/kube-proxy] Successfully restarted container on host [10.10.153.15]  
INFO[0073] [worker] Successfully restarted Worker Plane..

The rke cert rotate command ran through successfully and the cluster remained up. After waiting a couple of minutes to make sure all services and pods were correctly restarted on the Kubernetes cluster, the Kubernetes upgrade can  launched again.

Run the upgrade again

The big question is of course: Was the certificate rotation enough to fix the problem? Let's find out:

ck@linux:~/rancher$ ./rke_linux-amd64-1.1.2 up --config RANCHER2_STAGE/3-node-rancher-stage.yml
INFO[0000] Running RKE version: v1.1.2                   
INFO[0000] Initiating Kubernetes cluster                 
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates  
INFO[0000] [certificates] Generating admin certificates and kubeconfig  
INFO[0000] Successfully Deployed state file at [RANCHER2_STAGE/3-node-rancher-stage.rkestate]  
INFO[0000] Building Kubernetes cluster                   
INFO[0000] [dialer] Setup tunnel for host [10.10.153.17]  
INFO[0000] [dialer] Setup tunnel for host [10.10.153.15]  
INFO[0000] [dialer] Setup tunnel for host [10.10.153.16] 
[...]
INFO[0027] Pre-pulling kubernetes images                 
INFO[0027] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.16]  
INFO[0027] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.15]  
INFO[0027] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.17]  
INFO[0027] Kubernetes images pulled successfully         
INFO[0027] [etcd] Building up etcd plane..
[...]
INFO[0126] Checking if container [kubelet] is running on host [10.10.153.15], try #1  
INFO[0126] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.15]  
INFO[0126] Checking if container [old-kubelet] is running on host [10.10.153.15], try #1  
INFO[0126] Stopping container [kubelet] on host [10.10.153.15] with stopTimeoutDuration [5s], try #1  
INFO[0127] Waiting for [kubelet] container to exit on host [10.10.153.15]  
INFO[0127] Renaming container [kubelet] to [old-kubelet] on host [10.10.153.15], try #1  
INFO[0127] Starting container [kubelet] on host [10.10.153.15], try #1  
INFO[0127] [worker] Successfully updated [kubelet] container on host [10.10.153.15]  
INFO[0127] Removing container [old-kubelet] on host [10.10.153.15], try #1  
INFO[0128] [healthcheck] Start Healthcheck on service [kubelet] on host [10.10.153.15]  
INFO[0155] [healthcheck] service [kubelet] on host [10.10.153.15] is healthy  
[...]
INFO[0387] [sync] Syncing nodes Labels and Taints        
INFO[0387] [sync] Successfully synced nodes Labels and Taints  
INFO[0387] [network] Setting up network plugin: canal    
INFO[0387] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes  
INFO[0387] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes  
INFO[0387] [addons] Executing deploy job rke-network-plugin  
INFO[0402] [addons] Setting up coredns                   
INFO[0402] [addons] Saving ConfigMap for addon rke-coredns-addon to Kubernetes  
INFO[0402] [addons] Successfully saved ConfigMap for addon rke-coredns-addon to Kubernetes  
INFO[0402] [addons] Executing deploy job rke-coredns-addon  
INFO[0418] [addons] CoreDNS deployed successfully        
INFO[0418] [dns] DNS provider coredns deployed successfully  
INFO[0418] [addons] Setting up Metrics Server            
INFO[0418] [addons] Saving ConfigMap for addon rke-metrics-addon to Kubernetes  
INFO[0418] [addons] Successfully saved ConfigMap for addon rke-metrics-addon to Kubernetes  
INFO[0418] [addons] Executing deploy job rke-metrics-addon  
INFO[0438] [addons] Metrics Server deployed successfully  
INFO[0438] [ingress] Setting up nginx ingress controller  
INFO[0438] [addons] Saving ConfigMap for addon rke-ingress-controller to Kubernetes  
INFO[0438] [addons] Successfully saved ConfigMap for addon rke-ingress-controller to Kubernetes  
INFO[0438] [addons] Executing deploy job rke-ingress-controller  
INFO[0448] [ingress] ingress controller nginx deployed successfully  
INFO[0448] [addons] Setting up user addons               
INFO[0448] [addons] no user addons defined               
INFO[0448] Finished building Kubernetes cluster successfully

This time, rke continued its job to upgrade the Kubernetes cluster. Instead of a certificate error showing up, rke deployed the new Kubernetes version and renamed the previous K8s related containers (e.g. old-kubelet). While the upgrade was running, the current status could also be seen in the user interface where the current active node is seen as cordoned.

Oh no, I can't access the user interface anymore!

Depending on the load balancing or reverse proxy setup, after this Kubernetes version upgrade a redirect problem might happen when trying to access the Rancher 2 user interface:

With this Kubernetes upgrade also came a newer Nginx ingress version, which stopped listening on http and switched to https. If the reverse proxy was previously using plain http to the Rancher 2 node, it should now be replaced with using https. Here the snippet for HAProxy:

#####################
# Rancher
#####################
backend rancher-out
  balance hdr(X-Forwarded-For)
  timeout server 1h
  option httpchk GET /healthz HTTP/1.1\r\nHost:\ rancher2-stage.example.com\r\nConnection:\ close
  server onl-ran01-s 10.10.153.15:443 check ssl verify none inter 1000 fall 1 rise 2
  server onl-ran02-s 10.10.153.16:443 check ssl verify none inter 1000 fall 1 rise 2
  server onl-ran03-s 10.10.153.17:443 check ssl verify none inter 1000 fall 1 rise 2

Because the Rancher 2 generated Kubernetes certificates (which is the default setting in rke) are self-signed, the option "ssl verify none" should be added in HAProxy's backend server configuration.

Looking for a managed dedicated Kubernetes environment?

If you are looking for a managed and dedicated Kubernetes environment, managed by Rancher 2, with server location Switzerland, check out our Private Kubernetes Container Cloud Infrastructure service at Infiniroot.


More recent articles: