Skip to main content

Cloud Platform Disaster Recovery Scenarios

This section will document various scenarios that will create an incident. Each scenario will have an impact and restore process.

Losing a Namespace

Impact

A loss of a namespace will result in losing all resources within the namespace including ingress.

Possible Cause

User error - Executing kubectl delete ns my-namespace-dev

kubectl get all -n my-namespace-dev
No resources found in my-namespace-dev namespace.

Restore process

As long as the namespace is over 3 hours old, it can be recovered using Velero.

First step is to find the name of the most recent backup of the allnamespacebackup schedule:

velero backup get

NAME                                       STATUS      CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
velero-allnamespacebackup-00000000000000   Completed   2020-03-30 12:00:37 +0100 BST   29d       default            <none>

Using the backup name, you can now run a Velero restore command to restore the deleted namespace.

Restore command:

velero restore create <restore-name> --from-backup <backup-name> --include-namespaces <namespace> --wait

Example:

velero restore create my-namespace-dev-restore --from-backup velero-allnamespacebackup-00000000000000 --include-namespaces my-namespace-dev --wait

Once completed, you should be able to see the namespace and all resources recovered:

kubectl get all -n my-namespace-dev

NAME                       READY   STATUS      RESTARTS   AGE
pod/pod1                   2/2     Running     0          1m

NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)      AGE
service/service1           ClusterIP   None             None          443/TCP      1m

NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/deployment1          1/1     1            1           1m

NAME                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/replicaset1          1         1         1       1m

Losing a Kubernetes Component or Object

Impact

A loss of a kubernetes object such as a deployment, secret and Namespace can result in an application not working correctly. A loss of a Kubernetes component such as kube-scheduler or kube-control-manager can result in the cluster not working correctly.

Possible Cause

User error - Executing kubectl delete deployment/my-deployment-name ns my-namespace-dev

kubectl get deployment/my-deployment-name -n my-namespace-dev
Error from server (NotFound): deployments.extensions "my-deployment-name" not found.

Restore process

As long as the namespace is over 3 hours old, it can be recovered using Velero.

First step is to find the name of the most recent backup of the schedule:

velero backup get

NAME                                       STATUS      CREATED                         EXPIRES   STORAGE   LOCATION   SELECTOR
velero-backup-00000000000000              Completed   2020-03-30 12:00:37 +0100 BST     29d       default             <none>

Using the backup name, you can now run a Velero restore command to restore the deleted resource.

Restore command:

velero restore create <restore-name> --from-backup <backup-name> --include-namespaces <namespace> --wait

Example:

velero restore create my-namespace-dev-restore --from-backup velero-backup-00000000000000 --include-namespaces my-namespace-dev --wait

Once completed, you should be able to see the resources recovered:

kubectl get deployment/my-deployment-name -n my-namespace-dev

NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/my-deployment-name        1/1       1            1         1m

Losing the whole cluster

Impact

Severity : HIGH

Likelihood: LOW

Losing the whole kubernetes cluster will result in all services not being available. This means all the resources within all namespaces, networking, monitoring and logging systems.

Possible Cause

Losing the whole cluster is less likely and the possible causes may vary.

User error - Deleting all EC2 instances of <cluster-name> from console by accident

The first sign of this happening is likely to be Pagerduty alarms to the high/low priority slack channels.

How this plan is tested:

This way of restoring the whole cluster have been tested with below procedure

  1. Create cluster using the script
  2. Deploy starter pack to replicate user namespaces
  3. Take backup of the whole cluster using velero
  4. Destroy the cluster excluding the VPC using the script
  5. Create a cluster with the same name. That way velero can match the backup storage location
  6. Restore the cluster from the backup taken in Step 3

Assumptions

  • VPC of the cluster is not deleted
  • Not all Kubernetes resources are tested for Recovery
  • Not all AWS resources are tested for reconnecting to application after restore
  • Smoke tests are run excluding one for live
  • Terraform state have been recreated

Restore process

Any namespaces over 3 hours old can be recovered using Velero (newer namespaces might not have been backed up before the incident occurred).

Create the cluster with the same name from the source code and provide the exisiting vpc-name. This will link the velero backup locations to the lost cluster.

Find the name of the most recent backup of the allnamespacebackup schedule:

velero backup get

NAME                                       STATUS      CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
velero-allnamespacebackup-00000000000000   Completed   2020-03-30 12:00:37 +0100 BST   29d       default            <none>

Using the backup name, you can now run a velero restore command to restore the deleted namespace.

Restore command:

velero restore create <restore-name> --from-backup <backup-name> --wait

Example:

velero restore create my-namespace-dev-restore --from-backup velero-allnamespacebackup-00000000000000 --wait

Once completed, you should be able to see resources in all namespaces recovered.

Check the status of the restore and look for errors and warnings:

velero restore describe velero-allnamespacebackup-00000000000000-00000000000000

Check the logs of the restore:

velero restore logs velero-allnamespacebackup-00000000000000-00000000000000

Deleted terraform state

Severity : low

Likelihood: medium

Impact

Items removed from the terraform state are not physically destroyed but no longer managed by terraform. This means infrastructure is orphaned with no state.

For example, if you remove an AWS instance from the state, the AWS instance will continue running, but terraform plan will no longer see that instance.

Possible Cause

User error - Executing terraform state rm module.starter_pack.kubernetes_namespace.starter_pack will remove kubernetes starter_pack namespace from terraform state. When state is removed, running “terraform apply” will raise error, as terraform cannot track about existing resource and try to create new one.

terraform apply -target=module.starter_pack.kubernetes_namespace.starter_pack

Error: namespaces "starter-pack" already exists

Restore process

In order to regain control of resource removed from terraform state, we will use the import command to recover the metadata for this resource. Import requires a resource address and the resource ID. The resource address is a combination of the resource type and resource name.

terraform import module.starter_pack.kubernetes_namespace.starter_pack starter-pack

The import command creates a new state file pulling in information about the specified kubernetes_namespace.

module.starter_pack.kubernetes_namespace.starter_pack: Importing from ID "starter-pack"...
module.starter_pack.kubernetes_namespace.starter_pack: Refreshing state... [id=starter-pack]

Import successful!

The resources that were imported are shown above. These resources are now in
your Terraform state and will henceforth be managed by Terraform.

Now running terraform apply after import, will not raise any error.

terraform apply -target=module.starter_pack.kubernetes_namespace.starter_pack

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Recovering more complex scenarios

There are scenarios, where terraform state is corrupted or terraform state is removed for the whole module. Executing terraform state rm module.starter_pack, will remove all resources in “starter_pack” module from terraform state.

terraform plan -target=module.starter_pack

Plan: 7 to add, 0 to change, 0 to destroy.

In this scenario, terraform state can be restored from the remote_state stored in the terraform backend S3 bucket.

For example eks/components state is stored in “aws-accounts/cloud-platform-aws/vpc/eks/components” s3 bucket as defined here-eks.

Access the S3 bucket where the effected terraform state is stored. From the list of terraform.tfstate file versions, identify the file before the state got removed and download as terraform.tfstate. Upload the file again, this will set uploaded file as latest version.

Now running terraform plan will show, infrastructure is up-to-date.

terraform plan -target=module.starter_pack

No changes. Infrastructure is up-to-date.

Resolving a PartiallyFailed backup alert

A backup may fail and trigger an alert in lower-priority-alerts. inspect the backup job:

kubectl get backup -n velero | grep -C 30 YYYYMMDD

You identify the failed backup phase: PartiallyFailed and there will also by an errors field with a count.

To understand the cause of the alert pull out the error messages from the velero pod from kibana:

kubernetes.pod_name: velero-<container-id> and log: "level=error"

Sometimes the cause of the alert can be genuine, for instance a volume may have been removed (pod restart during a backup):

level=error msg="Error backing up item" backup=velero/velero-allnamespacebackup-20231120090023 error="error getting volume info: rpc error: code = Unknown desc = InvalidVolume.NotFound: The volume 'vol-08d317558ab5bd46b' does not exist.\n\tstatus code: 400
This page was last reviewed on 20 May 2024. It needs to be reviewed again on 20 November 2024 by the page owner #cloud-platform .
This page was set to be reviewed before 20 November 2024 by the page owner #cloud-platform. This might mean the content is out of date.