Skip to main content

Recover from an etcd failure

The intention of this document is to provide you with instruction on how to recover from an etcd failure. Etcd failures can include data corruption, volume deletion or hardware malfunctions which cause a myriad of issues on Kubernetes master nodes. By following the instructions in this document you should be able to:

  • Identify a failed etcd instance.
  • Remove the failed etcd instance from the cluster.
  • Backup the etcd EBS volume and perform a local data snapshot.
  • Add a new master with a new etcd instance.
  • Validate the cluster following etcd replacement.

Pre-requisites

Before you begin, there are a few pre-reqs:

  • You are targeting the right cluster.
$ kubectl config current-context

Expected outcome example: live-1.cloud-platform.service.justice.gov.uk

  • You are using the right AWS_PROFILE for this cluster.
$ export AWS_PROFILE=moj-cp
  • Export KOPS_STATE_STORE
$ export KOPS_STATE_STORE=s3://cloud-platform-kops-state

Identify the failed master node

Run kops validate cluster to identify failed master node.

kops validate cluster <cluster_name>.cloud-platform.service.justice.gov.uk

When a master node has failed, you will see validation errors like this:

VALIDATION ERRORS
KIND    NAME            MESSAGE
Machine i-046ed8d94d65719a6 machine "i-046ed8d94d65719a6" has not yet joined cluster

This shows that one of the machines not joined the cluster.

Make sure it is a master node by looking for the Instance ID in the Amazon EC2 console, on the Instances page. The instance ID should be something like:

master-<availability_zone>.masters.<cluster_name>.cloud-platform.service.justice.gov.uk

Remove the unhealthy etcd member.

As per this guidance remove the unhealthy etcd member (i.e. the failed master node) first.

Open your terminal and execute the following command to exec onto an etcd pod:

CONTAINER=$(kubectl get pods -n kube-system | grep etcd-manager-main | head -n 1 | awk '{print $1}')
kubectl exec -it -n kube-system $CONTAINER bash

When inside the container, determine which version of etcd is running:

DIRNAME=$(ps -ef | grep --color=never /opt/etcd | head -n 1 | awk '{print $8}' | xargs dirname)
echo $DIRNAME

Run etcdctl like this to get the member list of etcd-main. Change the PORT=4002 and ETCD=etcd-events and run it again to get the member list of etcd-events

ETCDCTL_API=3 $DIRNAME/etcdctl \
  --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt \
  --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt \
  --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key \
  --endpoints=https://127.0.0.1:4001 \
member list

You will see list of members like this:

26153d5a331dc13d, started, etcd-a, https://etcd-a.internal.test-etcd-1.cloud-platform.service.justice.gov.uk:2380, https://etcd-a.internal.test-etcd-1.cloud-platform.service.justice.gov.uk:4001
27aaafc750f8d2f7, started, etcd-c, https://etcd-c.internal.test-etcd-1.cloud-platform.service.justice.gov.uk:2380, https://etcd-c.internal.test-etcd-1.cloud-platform.service.justice.gov.uk:4001
8d4a23328d324c94, started, etcd-b, https://etcd-b.internal.test-etcd-1.cloud-platform.service.justice.gov.uk:2380, https://etcd-b.internal.test-etcd-1.cloud-platform.service.justice.gov.uk:4001

The first value in each line is the ID of the member. We will need this to remove the member from the etcd cluster.

Identify the unhealthy member (the old master node) and run the command below to remove the member.

NOTE: Kops maintains a separate etcd cluster for events, so we need to remove the old node from the etcd-events cluster as well. Change the PORT=4002 and ETCD=etcd-events and run it again to remove the old member from the events etcd cluster.

ETCDCTL_API=3 $DIRNAME/etcdctl \
  --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt \
  --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt \
  --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key \
  --endpoints=https://127.0.0.1:4001 \
member remove <member_id>

You will get a message:

Member 26153d5a331dc13d removed from cluster bdfd883e2bfa8594

Run the member list again and you will see the member has been removed:

27aaafc750f8d2f7, started, etcd-c, https://etcd-c.internal.test-etcd-1.cloud-platform.service.justice.gov.uk:2380, https://etcd-c.internal.test-etcd-1.cloud-platform.service.justice.gov.uk:4001
8d4a23328d324c94, started, etcd-b, https://etcd-b.internal.test-etcd-1.cloud-platform.service.justice.gov.uk:2380, https://etcd-b.internal.test-etcd-1.cloud-platform.service.justice.gov.uk:4001

Replacing the failed master via kops

You need to terminate the failed master and remove the volumes attached to it, and launch the new master with kops.

1) Back up volumes for both etcd main and events by creating snapshots from AWS console.

2) Take a etcdctl backup, by using docker exec to launch a shell on the kopeio/etcd-manager container in any of the working master nodes and run this command:

ETCDCTL_API=3 $DIRNAME/etcdctl \
  --cacert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-ca.crt \
  --cert=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.crt \
  --key=/rootfs/etc/kubernetes/pki/kube-apiserver/etcd-client.key \
  --endpoints=https://127.0.0.1:4001 \
snapshot save /rootfs/mnt/snapshot.db

You will get a message like:

Snapshot saved at /rootfs/mnt/snapshot.db

The snapshot.db is saved at /rootfs/mnt/, which is volume mounted to a local directory at /mnt.

Delete failed master node

3) Edit the kops instance group for the master you removed, to scale down by replacing maxSize and minSize values to 0, and run a kops update on the cluster. e.g:

kops edit instancegroup master-eu-west-2a
kops update cluster <cluster_name>.cloud-platform.service.justice.gov.uk --yes

4) In the Amazon EC2 console, on the Instances page, search using <cluster_name> to locate the failed master instance:

 master-<availability_zone>.masters.<cluster_name>.cloud-platform.service.justice.gov.uk

Terminate the failed master instance following this guide.

Delete etcd volumes

5) In the Amazon EBS volume console, search using <cluster_name> to locate the right volume associated with the broken master. Remove the volumes (main and events) for the failed master.

For example, for the master node instance

master-eu-west-2a.masters.live-1.cloud-platform.service.justice.gov.uk

…you should see volumes called:

a.etcd-main.live-1.cloud-platform.service.justice.gov.uk
a.etcd-events.live-1.cloud-platform.service.justice.gov.uk

Add new master node

6) Create the new master instance with kops.

Edit the kops instancegroup, and scale back up by changing maxSize and minSize values to 1, then run a kops update on the cluster.

kops edit instancegroup master-eu-west-2a
kops update cluster <cluster_name>.cloud-platform.service.justice.gov.uk --yes

A new master will be launched along with new etcd volumes attached to it, and should join the etcd cluster. This can take 10-15 minutes. After this, validate the cluster:

kops validate cluster <cluster_name>.cloud-platform.service.justice.gov.uk

You should see the new master in the list, and the cluster should be in ready status.

Your cluster <cluster_name>.cloud-platform.service.justice.gov.uk is ready

If the cluster is still in ‘not ready’ status, restart etcd on all masters. Follow the procedure menctioned in restart etcd.

This page was last reviewed on 13 October 2021. It needs to be reviewed again on 13 January 2022 by the page owner #cloud-platform .
This page was set to be reviewed before 13 January 2022 by the page owner #cloud-platform. This might mean the content is out of date.