We have a failed control plane node in our highly available multi-master Kubernetes cluster that we need to replace.
Before We Begin
We are using our Kubernetes homelab in this article.
We have a failed control plane srv31 that needs removing from the cluster and replacing with a new node.
Pre-check validation:
$ kubectl get no NAME STATUS ROLES AGE VERSION srv31 NotReady control-plane 375d v1.26.4 srv32 Ready control-plane 327d v1.26.4 srv33 Ready control-plane 456d v1.26.4 srv34 Ready none 456d v1.26.4 srv35 Ready none 327d v1.26.4 srv36 Ready none 456d v1.26.4
Also, we are going to use the ETDC client, if you don’t have it installed, download it using commands below:
$ ETCD_VER=v3.5.9 $ GITHUB_URL=https://github.com/etcd-io/etcd/releases/download $ DOWNLOAD_URL=${GITHUB_URL} $ mkdir -p /tmp/etcd-download-test $ curl -fsSL ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz -o /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz $ tar xzvf /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz -C /tmp/etcd-download-test --strip-components=1 $ rm -f /tmp/etcd-${ETCD_VER}-linux-amd64.tar.gz $ sudo cp /tmp/etcd-download-test/etcdctl /usr/local/bin/ $ etcdctl version
Remove an Unhealthy ETCD Member
Check ETCD member status on a working control plane:
$ ETCDCTL_API=3 etcdctl \ --endpoints 127.0.0.1:2379 \ --cacert /etc/kubernetes/pki/etcd/ca.crt \ --cert /etc/kubernetes/pki/etcd/server.crt \ --key /etc/kubernetes/pki/etcd/server.key \ member list c36952e9f5bf4f49, started, srv33, https://10.11.1.33:2380, https://10.11.1.33:2379, false df4ce5503d32478a, started, srv31, https://10.11.1.31:2380, https://10.11.1.31:2379, false e279a8288f4be237, started, srv32, https://10.11.1.32:2380, https://10.11.1.32:2379, false
Remove ETCD srv31 member ID:
$ ETCDCTL_API=3 etcdctl \ --endpoints 127.0.0.1:2379 \ --cacert /etc/kubernetes/pki/etcd/ca.crt \ --cert /etc/kubernetes/pki/etcd/server.crt \ --key /etc/kubernetes/pki/etcd/server.key \ member remove df4ce5503d32478a Member df4ce5503d32478a removed from cluster 53e3f96426ba03f3
Check ETCD member status again and make sure that ETCD member srv31 is no longer shown on the status:
$ ETCDCTL_API=3 etcdctl \ --endpoints 127.0.0.1:2379 \ --cacert /etc/kubernetes/pki/etcd/ca.crt \ --cert /etc/kubernetes/pki/etcd/server.crt \ --key /etc/kubernetes/pki/etcd/server.key \ member list c36952e9f5bf4f49, started, srv33, https://10.11.1.33:2380, https://10.11.1.33:2379, false e279a8288f4be237, started, srv32, https://10.11.1.32:2380, https://10.11.1.32:2379, false
Replace Failed Control Plane
Drain and delete the failed control plane srv31:
$ kubectl drain srv31 $ kubectl delete node srv31
Now we are ready to add a new control node.
Use your deployment pipeline (Ansible/Packer/Terraform/etc) to replace the broken control plane server with a new one.
To create a new certificate key, use the following command and run it on a working control plane (either srv32 or srv33):
$ sudo kubeadm init phase upload-certs --upload-certs ce34e277ab5b795e8b559d1aa8b2d243fd284acb193fb490b26ee9a695d0ccfe
Print the full kubeadm join flag needed to join the cluster as a control-plane (on a working control plane). Use the certificate key from above.
$ sudo kubeadm token create --print-join-command --certificate-key ce34e277ab5b795e8b559d1aa8b2d243fd284acb193fb490b26ee9a695d0ccfe kubeadm join kubelb.hl.test:6443 --token sqsh63.jw2p7kq6cy0cm7u5 --discovery-token-ca-cert-hash sha256:e98d5740c0ff6d5fd567cba755e27ea57fcc06fd694436a90ad632813351aae1 --control-plane --certificate-key ce34e277ab5b795e8b559d1aa8b2d243fd284acb193fb490b26ee9a695d0ccfe
Join the new control plane srv31 to the cluster:
$ sudo kubeadm join kubelb.hl.test:6443 \ --token sqsh63.jw2p7kq6cy0cm7u5 \ --discovery-token-ca-cert-hash sha256:e98d5740c0ff6d5fd567cba755e27ea57fcc06fd694436a90ad632813351aae1 \ --control-plane \ --certificate-key ce34e277ab5b795e8b559d1aa8b2d243fd284acb193fb490b26ee9a695d0ccfe
Verify:
$ kubectl get no NAME STATUS ROLES AGE VERSION srv31 Ready control-plane 84s v1.26.4 srv32 Ready control-plane 327d v1.26.4 srv33 Ready control-plane 456d v1.26.4 srv34 Ready none 456d v1.26.4 srv35 Ready none 327d v1.26.4 srv36 Ready none 456d v1.26.4
Check ETCD membership:
$ ETCDCTL_API=3 etcdctl \ --endpoints 127.0.0.1:2379 \ --cacert /etc/kubernetes/pki/etcd/ca.crt \ --cert /etc/kubernetes/pki/etcd/server.crt \ --key /etc/kubernetes/pki/etcd/server.key \ member list c36952e9f5bf4f49, started, srv33, https://10.11.1.33:2380, https://10.11.1.33:2379, false c44657d8f6e7dea5, started, srv31, https://10.11.1.31:2380, https://10.11.1.31:2379, false e279a8288f4be237, started, srv32, https://10.11.1.32:2380, https://10.11.1.32:2379, false