Migrating HA Kubernetes Cluster from CentOS 7 to Rocky Linux 8

Personally, Rocky Linux 8 is the most anticipated software release of the year!

The Upgrade Plan

We are going to upgrade our Kubernetes homelab nodes from CentOS 7 to Rocky Linux 8.

We have a cluster of six nodes, three control planes and three worker nodes, all of which are KVM guests running CentOS 7.

We will upgrade the control plane nodes first, one at a time using PXE boot and ansible playbooks, and then upgrade the worker nodes, also one at a time, using the same approach.

This is a lengthy process but does not require re-building the cluster from scratch. Note that will not be upgrading Kubernetes version or software components like Docker.

Software version before the upgrade:

  1. CentOS 7
  2. Docker 20.10
  3. Kubernetes 1.21.1
  4. Calico 3.19
  5. Istio 1.9

Software versions after the upgrade:

  1. Rocky Linux 8
  2. Docker 20.10
  3. Kubernetes 1.21.1
  4. Calico 3.19
  5. Istio 1.9

SELinux is set to enforcing mode.

Configuration Files

For PXE boot setup, see here.

For Rocky Linux 8 kickstart file, see GitHub repository here.

For Ansible playbooks, see GitHub repository here.

Caveats

RHEL 8 comes with firewalld that uses nftables by default. Depending on a CNI, you may or may not have hard time making Kubernetes pod-to-pod communication work with nftables.

Calico IptablesBackend specifies which backend of iptables will be used. The default is legacy. We will therefore delete firewalld from all Rocky Linux 8 nodes, what in turn will remove nftables package as a dependency. This configuration is not supported and has not been tested outside of the homelab environment.

Disclaimer

THIS IS NOT SUPPORTED.

Cluster Information

$ kubectl get nodes -o wide
NAME    STATUS   ROLES                  AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION           CONTAINER-RUNTIME
srv31   Ready    control-plane,master   123d   v1.21.1   10.11.1.31    none          CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
srv32   Ready    control-plane,master   123d   v1.21.1   10.11.1.32    none          CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
srv33   Ready    control-plane,master   123d   v1.21.1   10.11.1.33    none          CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
srv34   Ready    none                   123d   v1.21.1   10.11.1.34    none          CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
srv35   Ready    none                   95d    v1.21.1   10.11.1.35    none          CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7
srv36   Ready    none                   95d    v1.21.1   10.11.1.36    none          CentOS Linux 7 (Core)   3.10.0-1160.el7.x86_64   docker://20.10.7

Upgrade the First Control Plane Node

We will start with srv31.

Drain and Delete Control Plane from Kubernetes Cluster

Drain and delete the control plane from the cluster:

$ kubectl drain srv31
$ kubectl delete node srv31

Make sure the node is no longer in the Kubernetes cluster:

$ kubectl get nodes
NAME    STATUS   ROLES                  AGE    VERSION
srv32   Ready    control-plane,master   123d   v1.21.1
srv33   Ready    control-plane,master   123d   v1.21.1
srv34   Ready    none                   123d   v1.21.1
srv35   Ready    none                   95d    v1.21.1
srv36   Ready    none                   95d    v1.21.1

The cluster will remain operational as long as the other two control planes are online.

Delete Control Plane from Etcd Cluster

Etcd will have a record of all three control plane nodes. We therefore have to delete the control plane node from the Etcd cluster too.

$ kubectl get pods -n kube-system -l component=etcd -o wide
NAME         READY   STATUS    RESTARTS   AGE   IP           NODE    NOMINATED NODE   READINESS GATES
etcd-srv32   1/1     Running   2          9d    10.11.1.32   srv32   none             none
etcd-srv33   1/1     Running   2          9d    10.11.1.33   srv33   none             none

Query the cluster for the Etcd members:

$ kubectl exec etcd-srv32 \
  -n kube-system -- etcdctl \
  --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/peer.crt \
  --key /etc/kubernetes/pki/etcd/peer.key \
  member list
24b959b3d5fb9579, started, srv32, https://10.11.1.32:2380, https://10.11.1.32:2379, false
4a9dc4303465abc8, started, srv31, https://10.11.1.31:2380, https://10.11.1.31:2379, false
d60055f923c49949, started, srv33, https://10.11.1.33:2380, https://10.11.1.33:2379, false

Delete the member for control plane srv31:

$ kubectl exec etcd-srv32 \
  -n kube-system -- etcdctl \
  --cacert /etc/kubernetes/pki/etcd/ca.crt \
  --cert /etc/kubernetes/pki/etcd/peer.crt \
  --key /etc/kubernetes/pki/etcd/peer.key \
  member remove 4a9dc4303465abc8
Member 4a9dc4303465abc8 removed from cluster 53e3f96426ba03f3

Delete Control Plane KVM Guest

SSH into the hypervisor where the control plane server is running, and stop the VM:

$ ssh [email protected] "virsh destroy srv31-master"

Delete the current KVM snapshot (it’s the one from the previous Kubernetes upgrade):

$ ssh [email protected] "virsh snapshot-delete srv31-master --current"

Delete the control plane server image, including its storage:

$ ssh [email protected] "virsh undefine srv31-master --remove-all-storage"
Domain srv31-master has been undefined
Volume 'vda'(/var/lib/libvirt/images/srv31.qcow2) removed.

Create a Rocky Linux KVM Guest

Provision a new control plane KVM guest using PXE boot:

$ virt-install \
  --connect qemu+ssh://[email protected]/system \
  --name srv31-master \
  --network bridge=br0,model=virtio,mac=C0:FF:EE:D0:5E:31 \
  --disk path=/var/lib/libvirt/images/srv31.qcow2,size=16 \
  --pxe \
  --ram 4096 \
  --vcpus 2 \
  --os-type linux \
  --os-variant centos7.0 \
  --sound none \
  --rng /dev/urandom \
  --virt-type kvm \
  --wait 0

Once the server is up, set up passwordless root authentication and run Ansible playbook to configure Kubernetes homelab environment:

$ cd kubernetes-homelab/ansible
$ ssh-copy-id -f -i ./roles/hl.users/files/id_rsa_root.pub [email protected]
$ ansible-playbook playbooks/main-k8s-hosts.yml

Prepare Kubernetes Cluster for Control Plane Node to Join

SSH into a working control plane node, srv32, and re-upload certificates:

$ ssh [email protected] "kubeadm init phase upload-certs --upload-certs"
[upload-certs] Storing the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
[upload-certs] Using certificate key:
7118834f8c6ae140c574e44a495fe51705c869a91508e8152871ca26a500a440

Print the join command on the same control plane node:

$ ssh [email protected] "kubeadm token create --print-join-command"
kubeadm join kubelb.hl.test:6443 --token 1spj1c.zzzeydqo3yhvvaoy --discovery-token-ca-cert-hash sha256:f2e8bdc45d591d475c84a7cf69d56ba056ba034febe1561e7f77641d869ab0c5

SSH into the newly created control plane srv31 and join the Kubernetes cluster:

$ ssh [email protected] \
  "kubeadm join kubelb.hl.test:6443 --token 1spj1c.zzzeydqo3yhvvaoy \
  --discovery-token-ca-cert-hash sha256:f2e8bdc45d591d475c84a7cf69d56ba056ba034febe1561e7f77641d869ab0c5 \
  --control-plane \
  --certificate-key 7118834f8c6ae140c574e44a495fe51705c869a91508e8152871ca26a500a440"

Restart kubelet:

$ ssh [email protected] "systemctl restart kubelet"

Label the node:

$ kubectl label node srv31 node-role.kubernetes.io/control-plane=
$ kubectl label node srv31 node-role.kubernetes.io/master=

Check cluster status:

$ kubectl get nodes -o wide
NAME    STATUS   ROLES                  AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                           KERNEL-VERSION                CONTAINER-RUNTIME
srv31   Ready    control-plane,master   14m    v1.21.1   10.11.1.31    none          Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64   docker://20.10.7
srv32   Ready    control-plane,master   123d   v1.21.1   10.11.1.32    none          CentOS Linux 7 (Core)              3.10.0-1160.el7.x86_64        docker://20.10.7
srv33   Ready    control-plane,master   123d   v1.21.1   10.11.1.33    none          CentOS Linux 7 (Core)              3.10.0-1160.el7.x86_64        docker://20.10.7
srv34   Ready    none                   123d   v1.21.1   10.11.1.34    none          CentOS Linux 7 (Core)              3.10.0-1160.el7.x86_64        docker://20.10.7
srv35   Ready    none                   95d    v1.21.1   10.11.1.35    none          CentOS Linux 7 (Core)              3.10.0-1160.el7.x86_64        docker://20.10.7
srv36   Ready    none                   95d    v1.21.1   10.11.1.36    none          CentOS Linux 7 (Core)              3.10.0-1160.el7.x86_64        docker://20.10.7

We have our very first control plane running on Rocky Linux 8!

Repeat the process for the other two control planes, srv32 and srv33.

Do no proceed further until you upgrade all control planes:

$ kubectl get nodes -o wide
NAME    STATUS   ROLES                  AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                           KERNEL-VERSION                CONTAINER-RUNTIME
srv31   Ready    control-plane,master   167m   v1.21.1   10.11.1.31    none         Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64   docker://20.10.7
srv32   Ready    control-plane,master   32m    v1.21.1   10.11.1.32    none         Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64   docker://20.10.7
srv33   Ready    control-plane,master   7m42s  v1.21.1   10.11.1.33    none         Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64   docker://20.10.7
srv34   Ready    none                   124d   v1.21.1   10.11.1.34    none         CentOS Linux 7 (Core)              3.10.0-1160.el7.x86_64        docker://20.10.7
srv35   Ready    none                   95d    v1.21.1   10.11.1.35    none         CentOS Linux 7 (Core)              3.10.0-1160.el7.x86_64        docker://20.10.7
srv36   Ready    none                   95d    v1.21.1   10.11.1.36    none         CentOS Linux 7 (Core)              3.10.0-1160.el7.x86_64        docker://20.10.7
$ kubectl -n kube-system get pods -o wide
NAME                                       READY   STATUS    RESTARTS   AGE     IP               NODE    NOMINATED NODE   READINESS GATES
calico-kube-controllers-7f4f5bf95d-42n7z   1/1     Running   0          24m     192.168.137.36   srv35   none             none  
calico-node-4x79f                          1/1     Running   0          164m    10.11.1.31       srv31   none             none  
calico-node-54c25                          1/1     Running   0          29m     10.11.1.32       srv32   none             none  
calico-node-7fmzb                          1/1     Running   1          9d      10.11.1.36       srv36   none             none  
calico-node-hvh28                          1/1     Running   0          4m39s   10.11.1.33       srv33   none             none  
calico-node-p5vkt                          1/1     Running   1          9d      10.11.1.35       srv35   none             none  
calico-node-stfm6                          1/1     Running   1          9d      10.11.1.34       srv34   none             none  
coredns-85d9df8444-897jj                   1/1     Running   0          110m    192.168.137.34   srv35   none             none  
coredns-85d9df8444-stn4d                   1/1     Running   0          24m     192.168.137.37   srv35   none             none  
etcd-srv31                                 1/1     Running   0          157m    10.11.1.31       srv31   none             none  
etcd-srv32                                 1/1     Running   0          26m     10.11.1.32       srv32   none             none  
etcd-srv33                                 1/1     Running   0          4m36s   10.11.1.33       srv33   none             none  
kube-apiserver-srv31                       1/1     Running   6          164m    10.11.1.31       srv31   none             none  
kube-apiserver-srv32                       1/1     Running   4          29m     10.11.1.32       srv32   none             none  
kube-apiserver-srv33                       1/1     Running   0          4m38s   10.11.1.33       srv33   none             none  
kube-controller-manager-srv31              1/1     Running   0          164m    10.11.1.31       srv31   none             none  
kube-controller-manager-srv32              1/1     Running   0          29m     10.11.1.32       srv32   none             none  
kube-controller-manager-srv33              1/1     Running   0          4m38s   10.11.1.33       srv33   none             none  
kube-proxy-5d25q                           1/1     Running   0          4m39s   10.11.1.33       srv33   none             none  
kube-proxy-bpbrc                           1/1     Running   0          29m     10.11.1.32       srv32   none             none  
kube-proxy-ltssd                           1/1     Running   1          9d      10.11.1.36       srv36   none             none  
kube-proxy-rqmk6                           1/1     Running   0          164m    10.11.1.31       srv31   none             none  
kube-proxy-z9wg2                           1/1     Running   2          9d      10.11.1.35       srv35   none             none  
kube-proxy-zkj8c                           1/1     Running   1          9d      10.11.1.34       srv34   none             none  
kube-scheduler-srv31                       1/1     Running   0          164m    10.11.1.31       srv31   none             none  
kube-scheduler-srv32                       1/1     Running   0          29m     10.11.1.32       srv32   none             none  
kube-scheduler-srv33                       1/1     Running   0          4m38s   10.11.1.33       srv33   none             none  

Upgrade Worker Nodes

We will start with srv34.

Drain and Delete Worker Node from Kubernetes Cluster

$ kubectl drain srv34 --delete-emptydir-data --ignore-daemonsets
$ kubectl delete node srv34

Make sure the node is no longer in the Kubernetes cluster:

$ kubectl get nodes
NAME    STATUS   ROLES                  AGE   VERSION
srv31   Ready    control-plane,master   13h   v1.21.1
srv32   Ready    control-plane,master   11h   v1.21.1
srv33   Ready    control-plane,master   11h   v1.21.1
srv35   Ready    none                   95d   v1.21.1
srv36   Ready    none                   95d   v1.21.1

Stop the server:

$ ssh [email protected] "virsh destroy srv34-node"
Domain srv34-node destroyed

Delete the current snapshot:

$ ssh [email protected] "virsh snapshot-delete srv34-node --current"

Delete the server, including its storage:

$ ssh [email protected] "virsh undefine srv34-node --remove-all-storage"
Domain srv34-node has been undefined
Volume 'vda'(/var/lib/libvirt/images/srv34.qcow2) removed.

Create a Rocky Linux KVM Guest

Provision a new KVM guest using PXE boot:

$ virt-install \
  --connect qemu+ssh://[email protected]/system \
  --name srv34-node \
  --network bridge=br0,model=virtio,mac=C0:FF:EE:D0:5E:34 \
  --disk path=/var/lib/libvirt/images/srv34.qcow2,size=16 \
  --pxe \
  --ram 8192 \
  --vcpus 2 \
  --os-type linux \
  --os-variant centos7.0 \
  --sound none \
  --rng /dev/urandom \
  --virt-type kvm \
  --wait 0

Once the server is up, set up passwordless root authentication and run Ansible playbook to configure Kubernetes homelab environment:

$ cd kubernetes-homelab/ansible
$ ssh-copy-id -f -i ./roles/hl.users/files/id_rsa_root.pub [email protected]
$ ansible-playbook playbooks/main-k8s-hosts.yml

SSH into the newly created worker node srv34 and join the Kubernetes cluster:

$ ssh [email protected] \
  "kubeadm join kubelb.hl.test:6443 --token 1spj1c.zzzeydqo3yhvvaoy \
  --discovery-token-ca-cert-hash sha256:f2e8bdc45d591d475c84a7cf69d56ba056ba034febe1561e7f77641d869ab0c5"

Restart kubelet:

$ ssh [email protected] "systemctl restart kubelet"

Check cluster status:

$ kubectl get nodes -o wide
NAME    STATUS   ROLES                  AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                           KERNEL-VERSION                CONTAINER-RUNTIME
srv31   Ready    control-plane,master   14h   v1.21.1   10.11.1.31    none          Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64   docker://20.10.7
srv32   Ready    control-plane,master   12h   v1.21.1   10.11.1.32    none          Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64   docker://20.10.7
srv33   Ready    control-plane,master   12h   v1.21.1   10.11.1.33    none          Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64   docker://20.10.7
srv34   Ready    none                   20m   v1.21.1   10.11.1.34    none          Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64   docker://20.10.7
srv35   Ready    none                   95d   v1.21.1   10.11.1.35    none          CentOS Linux 7 (Core)              3.10.0-1160.el7.x86_64        docker://20.10.7
srv36   Ready    none                   95d   v1.21.1   10.11.1.36    none          CentOS Linux 7 (Core)              3.10.0-1160.el7.x86_64        docker://20.10.7

Repeat the process for the other two worker nodes, srv35 and srv36.

The end result should be all nodes running Rocky Linux 8:

$ kubectl get nodes -o custom-columns=NAME:.metadata.name,VERSION:.status.nodeInfo.kubeletVersion,OS-IMAGE:.status.nodeInfo.osImage,KERNEL:.status.nodeInfo.kernelVersion
NAME    VERSION   OS-IMAGE                           KERNEL
srv31   v1.21.1   Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64
srv32   v1.21.1   Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64
srv33   v1.21.1   Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64
srv34   v1.21.1   Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64
srv35   v1.21.1   Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64
srv36   v1.21.1   Rocky Linux 8.4 (Green Obsidian)   4.18.0-305.3.1.el8_4.x86_64

4 thoughts on “Migrating HA Kubernetes Cluster from CentOS 7 to Rocky Linux 8

  1. Phew! Seems like a lot of work!
    So, for your home lab you decided to stick with IPTables instead of adopting nftables based on personal preference, did I read it right? If pod-to-pod communication is disrupted by nftables but production environments require it, how do you think the enterprise addresses that new issue?

    • No strong preference between iptables and nftables. Rocky 8 uses nftables by default, I tried it with Calico CNI and it didn’t work for me. I’ll admit that I didn’t spend that much time debugging it to be honest as it’s only a homelab environment. I then tried it with iptables and Calico worked without issues. I switched to iptables therefore (iptables or firewalld/nftables, I don’t mind as long as it works, as it’s managed by Ansible anyway). I don’t think that Kubernetes is supported on RHEL 8 (I may be wrong), for production environments, I’d suggest RHEL 7.

      In terms of large enterprises and how they address this issue, well, the way I see it, the problem isn’t Kubernetes, it’s the CNI that has to work with whatever software vendors decide to use as a backend for firewall. If you use a CNI that does not support nftables, then you either have to stick to iptables, or move to a different CNI.

  2. I deployed on RHEL 8 on AWS Kubernetes with Calico CNI without Firewalld/nftables around September 2022. It’s supported by Redhat.

    • Hi Oli, thanks, I wasn’t aware. Have you got any link to Red Hat’s documentation where it says it’s a supported deployment?

Leave a Reply

Your email address will not be published. Required fields are marked *