Kubernetes etcd near disaster


This post is mostly a warning to me for the future, but hopefully it'll prevent somebody else from going through the same problem. I've been running a small Kubernetes cluster for a couple of years now, mostly as an experiment and to keep my skills tuned for new tooling. Part of that has been making sure I use reasonable tooling and automate as much as I can.

So far, I've been pretty happy using:

Keeping kubernetes up to date

I've been in the habit of keeping my k8s cluster up-to-date for some time. Usually just redeploying the RKE cluster whenever there's a notable upgrade and keeping major dependencies up to date using automation based on mend.

This has worked well for the kubernetes infrastructure, but since the "machines" that run my cluster are all VMs, I also want to update them occasionally, updating the underlying OS and proving that I can rebuild the environment.

For worker nodes in the cluster, this has generally worked well. I have a basic, automated by Ansible process of:

  1. Set the node to unschedulable
  2. replace the node
  3. run rke command in order to bring the node back into the cluster

This has worked fine, probably because the worker nodes, once quiesced, are not unique in any way.

Updating my control plane nodes

The problem came when I went to follow the same procedure for my control plane nodes. I took the first node offline, built a new node, and ran rke to bring the node back online. Everything seemed to be functioning well, so I went on to the next node and everything came to a halt.

It took me a few minutes to figure out what I'd done, but the key is that, unlike the worker nodes, the nodes running etcd are a bit special. In particular, they carry a unique ID that is embedded in their local database.

Running docker exec etcd etcdctl --write-out=table member list, you can see the ID on the left:

+------------------+---------+-------------+--------------------------+--------------------------+------------+
|        ID        | STATUS  |  NAME       | PEER ADDRS               | CLIENT ADDRS             | IS LEARNER |
+------------------+---------+-------------+--------------------------+--------------------------+------------+
| 51e442e065ed8da9 | started | etcd-node-1 | https://etcd-node-1:2380 | https://etcd-node-1:2379 |      false |
| 7c17ab818595f4fe | started | etcd-node-0 | https://etcd-node-0:2380 | https://etcd-node-0:2379 |      false |
| d085086f6d909371 | started | etcd-node-2 | https://etcd-node-2:2380 | https://etcd-node-2:2379 |      false |
+------------------+---------+-------------+--------------------------+--------------------------+------------+

Everything was fine when I deleted the first node, since the environment was configured for HA, it had 2/3 of its nodes available, which meant it still had a good voting majority left. And, when the node was re-provisioned, it was still fine, because it could contact 3/4 of the etcd nodes. However, when I went to replace the second node, the entire cluster failed, because etcd reached a state where only 2/4 of the etcd nodes were available and it couldn't reach quorum.

Once I was able to ascertain that the original node hadn't been replaced in the node list, the solution was relatively simple, and I deleted the errant node using:

docker exec etcd etcdctl member remove [id]

In the future, when working with the etcd nodes, I've made modifications to my MOP and the my ansible scripting:

  1. Make sure to check the status of the etcd membership before replacing any node

    docker exec etcd etcdctl --write-out=table member list

  2. Make an explicit snapshot of the cluster before replacing the nodes:

    rke etcd snapshot-save --name extra_snapshot.db --config cluster.yml

  3. Remove the old etcd node as soon as possible to prevent negative effect on the quorum:

    docker exec etcd etcdctl member remove [id]

Recovery from failure

In order to get the cluster back into working shape, I did do a restore and rebuild once I figured out what had gone on. This also involved using the most recent backup from etcd. (I also took a backup of the botched etcd situation before restoring).

rke etcd snapshot-save --name disaster.db --config cluster.yml
rke etcd snapshot-restore --name extra_snapshot.db --config cluster.yml
rke up

Note that there is reasonable disaster recovery documentation in etcd's documentation.