Skip to main content

Cloud Platform Disaster Recovery Scenarios

This section will document various scenarios that will create an incident. Each scenario will have an impact and restore process.

Losing a Namespace

Impact

A loss of a namespace will result in losing all resources within the namespace including ingress.

Possible Cause

User error - Executing kubectl delete ns my-namespace-dev

kubectl get all -n my-namespace-dev
No resources found in my-namespace-dev namespace.

Restore process

As long as the namespace is over 3 hours old, it can be recovered using Velero.

First step is to find the name of the most recent backup of the allnamespacebackup schedule:

velero backup get

NAME                                       STATUS      CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
velero-allnamespacebackup-00000000000000   Completed   2020-03-30 12:00:37 +0100 BST   29d       default            <none>

Using the backup name, you can now run a Velero restore command to restore the deleted namespace.

Restore command:

velero restore create <restore-name> --from-backup <backup-name> --include-namespaces <namespace> --wait

Example:

velero restore create my-namespace-dev-restore --from-backup velero-allnamespacebackup-00000000000000 --include-namespaces my-namespace-dev --wait

Once completed, you should be able to see the namespace and all resources recovered:

kubectl get all -n my-namespace-dev

NAME                       READY   STATUS      RESTARTS   AGE
pod/pod1                   2/2     Running     0          1m

NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)      AGE
service/service1           ClusterIP   None             None          443/TCP      1m

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/deployment1          1/1     1            1           1m

NAME                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/replicaset1          1         1         1       1m

Losing a Kubernetes Component or Object

Impact

A loss of a kubernetes object such as a deployment, secret and Namespace can result in an application not working correctly. A loss of a Kubernetes component such as kube-scheduler or kube-control-manager can result in the cluster not working correctly.

Possible Cause

User error - Executing kubectl delete deployment/my-deployment-name ns my-namespace-dev

kubectl get deployment/my-deployment-name -n my-namespace-dev
Error from server (NotFound): deployments.extensions "my-deployment-name" not found.

Restore process

As long as the namespace is over 3 hours old, it can be recovered using Velero.

First step is to find the name of the most recent backup of the schedule:

velero backup get

NAME                                       STATUS      CREATED                         EXPIRES   STORAGE   LOCATION   SELECTOR
velero-backup-00000000000000              Completed   2020-03-30 12:00:37 +0100 BST     29d       default             <none>

Using the backup name, you can now run a Velero restore command to restore the deleted resource.

Restore command:

velero restore create <restore-name> --from-backup <backup-name> --include-namespaces <namespace> --wait

Example:

velero restore create my-namespace-dev-restore --from-backup velero-backup-00000000000000 --include-namespaces my-namespace-dev --wait

Once completed, you should be able to see the resources recovered:

kubectl get deployment/my-deployment-name -n my-namespace-dev

NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/my-deployment-name        1/1       1            1         1m

Losing the whole cluster

Impact

Severity : HIGH

Likelihood: LOW

Losing the whole kubernetes cluster will result in all services not being available. This means all the resources within all namespaces, networking, monitoring and logging systems.

Possible Cause

Losing the whole cluster is less likely and the possible causes may vary.

User error - Deleting all EC2 instances of <cluster-name> from console by accident

The first sign of this happening is likely to be Pagerduty alarms to the high/low priority slack channels.

How this plan is tested:

This way of restoring the whole cluster have been tested with below procedure

  1. Create cluster using the script
  2. Deploy starter pack to replicate user namespaces
  3. Take backup of the whole cluster using velero
  4. Destroy the cluster excluding the VPC using the script
  5. Create a cluster with the same name. That way velero can match the backup storage location
  6. Restore the cluster from the backup taken in Step 3

Assumptions

  • VPC of the cluster is not deleted
  • Not all Kubernetes resources are tested for Recovery
  • Not all AWS resources are tested for reconnecting to application after restore
  • Smoke tests are run excluding one for live-1
  • Terraform state have been recreated

Restore process

Any namespaces over 3 hours old can be recovered using Velero (newer namespaces might not have been backed up before the incident occurred).

Create the cluster with the same name from the source code and provide the exisiting vpc-name. This will link the velero backup locations to the lost cluster.

Find the name of the most recent backup of the allnamespacebackup schedule:

velero backup get

NAME                                       STATUS      CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
velero-allnamespacebackup-00000000000000   Completed   2020-03-30 12:00:37 +0100 BST   29d       default            <none>

Using the backup name, you can now run a velero restore command to restore the deleted namespace.

Restore command:

velero restore create <restore-name> --from-backup <backup-name> --wait

Example:

velero restore create my-namespace-dev-restore --from-backup velero-allnamespacebackup-00000000000000 --wait

Once completed, you should be able to see resources in all namespaces recovered.

Check the status of the restore and look for errors and warnings:

velero restore describe velero-allnamespacebackup-00000000000000-00000000000000

Check the logs of the restore:

velero restore logs velero-allnamespacebackup-00000000000000-00000000000000

Deleted/corrupted the kops state

Impact

Severity : LOW

Likelihood: LOW/MEDIUM

When the kops state is corrupted, in most cases the kubernetes cluster will still be operational.

This means the services will not have any impact, if the issue is resolved in a shorter amount of time.

Leaving the cluster in the corrupted kops state for a prolonged period will reduce the cluster being highly available and not handle any failover in case one or more masters has gone down.

Hence, Cloud platform team will not be able to perform any infrastructure changes. Also, teams will not able to communicate with the cluster to deploy applications or schedule any jobs.

Possible Cause

Kops state may get corruputed when the kops template is changed for

  • upgrading kubernetes
  • changing an instancegroup
  • any other config changes

It could also be corrupted/lost if a user deleted the s3 bucket by accident where the kops state is stored

The sign of this happening can be seen when doing kops validate cluster <cluster name>

If the kops state is corrupted, some of the worker nodes or masters machines will not join the cluster.

The above command will show something similar to


VALIDATION ERRORS
KIND    NAME            MESSAGE
Machine i-04195929decb7a43b             machine "i-04195929decb7a43b" has not yet joined cluster
Node    ip-172-20-74-189.eu-west-2.compute.internal node "ip-172-20-74-189.eu-west-2.compute.internal" is not ready
Pod kube-system/calico-node-kp6m7           kube-system pod "calico-node-kp6m7" is pending

Validation Failed

If the s3 bucket is deleted where the kops state is stored, then kops validate ... show as


error reading cluster configuration: Cluster.kops.k8s.io "<cluster name>.cloud-platform.service.justice.gov.uk" not found

Restore process for corrupted kops state

The cluster kops state is stored in the generated yaml file inside the folder kops/<CLUSTER_NAME>.yaml.

For live-1 the generated kops yaml file is stored in the github. So, checkout the previous working commit and update the remote kops state with that file.

If the cluster state is corrupted due to the latest changes to the yaml file, revert back the changes and update the remote kops state.

Below are the steps to update and restore the kops state

** Revert the kops yaml file kops/<CLUSTER_NAME>.yaml to previous working version.

Authenticate to the cluster

kops export kubecfg XXXXXXX.cloud-platform.service.justice.gov.uk

Update the remote state with the reverted changes

kops replace -f kops/<CLUSTER_NAME>.yaml
kops update cluster

Update the cluster with –yes to apply the changes

kops update cluster --yes

If the above command reports that the cluster needs to be rolled, do the rolling-update and with --yes option

kops rolling-update cluster --yes

For more information on kops update, rolling-update and debugging instructions, refer runbook for kops update

If the restore process require creating new instancegroup for worker nodes, follow the procedure mentioned in Upgrade a cluster.

The difference to the upgrading procedure would be

  • reverting the kops yaml file to old working version of kubernetes and

  • if new instancegroup for worker nodes is needed, then to create one with working kubernetes version and move workload back to that worker nodes.

Restore process for deleted S3 bucket where kops state is stored

Kops state of all clusters are backed up in a different s3 bucket kops-backup-replication in the region eu-west-2.

If the entire folder where the cluster state is stored is lost/corrupted, copy the folder from the backup bucket by running below

aws s3 cp s3://kops-backup-replication/<cluster_name>.cloud-platform.service.justice.gov.uk/ s3://cloud-platform-kops-state/ --recursive

Validate the cluster

kops validate cluster <cluster name>

The validation should be success and should say

“Your cluster <cluster name>.cloud-platform.service.justice.gov.uk is ready”

Deleted terraform state

Severity : low

Likelihood: medium

Impact

Items removed from the terraform state are not physically destroyed but no longer managed by terraform. This means infrastructure is orphaned with no state.

For example, if you remove an AWS instance from the state, the AWS instance will continue running, but terraform plan will no longer see that instance.

Possible Cause

User error - Executing terraform state rm module.starter_pack.kubernetes_namespace.starter_pack will remove kubernetes starter_pack namespace from terraform state. When state is removed, running “terraform apply” will raise error, as terraform cannot track about existing resource and try to create new one.

terraform apply -target=module.starter_pack.kubernetes_namespace.starter_pack

Error: namespaces "starter-pack" already exists

Restore process

In order to regain control of resource removed from terraform state, we will use the import command to recover the metadata for this resource. Import requires a resource address and the resource ID. The resource address is a combination of the resource type and resource name.

terraform import module.starter_pack.kubernetes_namespace.starter_pack starter-pack

The import command creates a new state file pulling in information about the specified kubernetes_namespace.

module.starter_pack.kubernetes_namespace.starter_pack: Importing from ID "starter-pack"...
module.starter_pack.kubernetes_namespace.starter_pack: Refreshing state... [id=starter-pack]

Import successful!

The resources that were imported are shown above. These resources are now in
your Terraform state and will henceforth be managed by Terraform.

Now running terraform apply after import, will not raise any error.

terraform apply -target=module.starter_pack.kubernetes_namespace.starter_pack

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Recovering more complex scenarios

There are scenarios, where terraform state is corrupted or terraform state is removed for the whole module. Executing terraform state rm module.starter_pack, will remove all resources in “starter_pack” module from terraform state.

terraform plan -target=module.starter_pack

Plan: 7 to add, 0 to change, 0 to destroy.

In this scenario, terraform state can be restored from the remote_state stored in the terraform backend S3 bucket.

For example cloud-platform-components state is stored in “cloud-platform-terraform-state/cloud-platform-components” s3 bucket as defined here.

Access the S3 bucket where the effected terraform state is stored. From the list of terraform.tfstate file versions, identify the file before the state got removed and download as terraform.tfstate. Upload the file again, this will set uploaded file as latest version.

Now running terraform plan will show, infrastructure is up-to-date.

terraform plan -target=module.starter_pack

No changes. Infrastructure is up-to-date.

Lost or corrupt etcd volume

Severity : low

Likelihood: medium

Impact

Kubernetes live-1 cluster deployed with kops stores the etcd state in two different AWS EBS volumes per master node. One volume is used to store the Kubernetes main data, the other one for events. For a cluster with three masters will result in six volumes for etcd data (one in each AZ).

Lost or corrupt etcd volume results in unhealthy etcd cluster. This means master node associated with the corrupt or Lost etcd volume will become not ready status and fail to join the cluster.

Possible Cause

  • Lost volume: Deleting one of the etcd volume associated with the master node from aws console by accident
  • Corrupt volume: Accidentally restoring etcd volume from a wrong snapshot and attach it to the working master node.

We will see warnings like this in this master log file:

kubectl -n kube-system logs etcd-manager-main-ip-xxx-xx-xx-xxx.eu-west-2.compute.internal

etcdserver: failed to reach the peerURL(https://etcd-x.internal.<cluster-name>.cloud-platform.service.justice.gov.uk:2380) of member <member-id> 
rafthttp: health check for peer <member-id> connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
rafthttp: health check for peer <member-id> connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")

kops validate cluster

VALIDATION ERRORS
KIND    NAME            MESSAGE
Node    <node-name>     master "<node-name>" is not ready

Validation Failed

Restore process

Restoring corrupt or lost etcd involves two steps: remove the unhealthy etcd member and replace the effected master node via kops. Covering both steps, we already have a runbook, follow it step by step to restore unhealthy etcd.

Overwriting KOPS cluster’s components with EKS components

Impact

Certain components that are unique to EKS such as concourse will be introduced to the KOPS cluster. A number of components will have specific annotations applied to them that are only applicable for an EKS cluster.

Possible Cause

The EKS components are applied whilst the kubectl context is set to a kops cluster.

Restore process

Use terraform destroy to delete all existing components from the cloud-platform-eks/components folder and re-apply using terraform apply from the cloud-platform-components folder.

Only those resources that are part of the state can be re-applied, however there are certain resources created by service teams such as certificates and resources created from CRDs that will not be restored by terraform apply. For these resources a Velero restore will have to be executed.

Reproducing Incident Steps

To test the recovery of the CRD resources, first ensure that they exist so that when recovering we can verify they are restored by Velero. Below are templates for each of the CRDs used.

Service Monitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dr-test-service
spec:
  selector:
    matchLabels:
      app: dr-test-service
  endpoints:
  - port: http
    interval: 15s

Save the above yaml and execute into the desired namespace e.g kubectl create -f <FILE_NAME>.yaml -n <NAMESPACE>

Prometheus Rule

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    role: alert-rules
  name: prometheus-custom-rules-data-dev
spec:
  groups:
  - name: node.rules
    rules:
    - alert: cica--Quota-Exceeded--dev
      expr: 100 * kube_resourcequota{job="kube-state-metrics",type="used",namespace="<NAMESPACE>"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 90
      for: 5m
      labels:
        severity: cica-dev-dev
      annotations:
        message: Namespace {{ $labels.namespace }} is using {{ printf "%0.0f" $value}}% of its {{ $labels.resource }} quota.
        runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree
    - alert: cica--KubeJobFailed--dev
      annotations:
        message: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete.
        runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed
      expr: kube_job_status_failed{job="kube-state-metrics",namespace="<NAMESPACE>"}  > 0
      for: 1m
      labels:
        severity: cica-dev-dev
    - alert: cica--KubeletTooManyPods--dev
      expr: kubelet_running_pod_count{job="kubelet",namespace="<NAMESPACE>"}
        > 0 * 0.9
      for: 5m
      labels:
        severity: cica-dev-team
      annotations:
        message: Kubelet {{ $labels.instance }} is running {{ $value }} Pods, close to the
          limit of 1 Test.
        runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubelettoomanypods
    - alert: cica-dev-Node-Disk-Space-Low
      annotations:
        message: A node is reporting low disk space below 10%. Action is required on a node.
        runbook_url: https://github.com/ministryofjustice/cloud-platform-infrastructure/blob/master/terraform/cloud-platform-components/resources/prometheusrule-alerts/Custom-Alerts.md#node-disk-space-low
      expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes) < 10
      for: 30m
      labels:
        severity: cica-dev-team
    - alert: cica--KubeDeploymentReplicasMismatch--dev
      annotations:
       message: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than an hour.
       runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentreplicasmismatch
      expr: kube_deployment_spec_replicas{job="kube-state-metrics",namespace="<NAMESPACE>"}
        != kube_deployment_status_replicas_available{job="kube-state-metrics",namespace="<NAMESPACE>"}
      for: 1h
      labels:
        severity: cica-dev-dev

Replace the placeholder <NAMESPACE> with the namespace where this will will created.

Save the above yaml and execute into the desired namespace e.g kubectl create -f <FILE_NAME>.yaml -n <NAMESPACE>

Certificate

apiVersion: cert-manager.io/v1alpha3
kind: Certificate
metadata:
  name: cert-dr-test
spec:
  secretName: dr-test-tls-certificate
  issuerRef:
    name: letsencrypt-production
    kind: ClusterIssuer
  dnsNames:
  - helloworld-app-starter-pack.apps.<CLUSTER_NAME>.cloud-platform.service.justice.gov.uk

Save the above yaml and execute into the desired namespace e.g kubectl create -f <FILE_NAME>.yaml -n <NAMESPACE>

Velero Backup

Before destroying ensure you have taken a backup using velero.

velero backup create my-app-backup01

Apply the EKS components to the KOPS cluster

1) Ensure you have access to both a test KOPS cluster and a EKS test cluster

2) Ensure kubectl context is set to the KOPS cluster

3) Browse to the cloud-platform-infrastructure/terraform/cloud-platform-eks/components

4) Ensure you are on the terraform workspace of the kops cluster

4) Execute terraform apply (enter ‘yes’ when prompted to confirm the apply)

Observations of Incident

Once the EKS components have been applied to the KOPS cluster, the following changes to the cluster should be observed

  • Following components are newly created in the kops cluster (unique to EKS)

    • concourse
  • Following components are modified:

Cert-manager

(1)

The following annotation is applied to the deployment:

iam.amazonaws.com/permitted : aws_iam_role.cert_manager.0.name

(2)

In ClusterIssuer resource the iam role is applied to the route53 within the solvers block:

    solvers:
    - selector: {}
      dns01:
        cnameStrategy: "Follow"
        # Here we define a list of DNS-01 providers that can solve DNS challenges
        route53:
          region: eu-west-2
          role: "${iam_role}"

External DNS

(1)

Following annotations are applied to the external dns pods

podAnnotations:
  iam.amazonaws.com/role: "${iam_role}"

The following annotations are added to external dns deployment

  annotations:
    eks.amazonaws.com/role-arn: "${eks_service_account}"  

Destroy the EKS components

The first step for recovery is to destroy what was applied from the same directory.

Ensure you in the cloud-platform-infrastructure/terraform/cloud-platform-eks/components directory.

Execute terraform destroy -auto-approve

Re-apply the KOPS components

Re-create the components but this time run terraform apply -auto-approve from the ...terraform/cloud-platform-components directory

After re-creating the components you may notice that the CRD resources and ingress certificates are still missing. To restore these run velero restore... as follows:

velero backup get

NAME                                       STATUS      CREATED                         EXPIRES   STORAGE   LOCATION   SELECTOR
velero-backup-00000000000000              Completed   2020-03-30 12:00:37 +0100 BST     29d       default             <none>

Using the backup name, you can now run a velero restore... command to restore the deleted resource.

Restore command:

velero restore create <restore-name> --from-backup <backup-name> 

Example:

velero restore create my-namespace-dev-restore --from-backup velero-backup-00000000000000

Once completed, you should be able to see the CRD resources that were created earlier, above before the incident.

kubectl get prometheusrules -n <NAMESPACE>
kubectl get servicemonitors -n <NAMESPACE>
kubectl get certificates -n <NAMESPACE>

This page was last reviewed on 22 June 2021. It needs to be reviewed again on 22 September 2021 by the page owner #cloud-platform .
This page was set to be reviewed before 22 September 2021 by the page owner #cloud-platform. This might mean the content is out of date.