Cloud Platform Disaster Recovery Scenarios
This section will document various scenarios that will create an incident. Each scenario will have an impact and restore process.
Losing a Namespace
Impact
A loss of a namespace will result in losing all resources within the namespace including ingress.
Possible Cause
User error - Executing kubectl delete ns my-namespace-dev
kubectl get all -n my-namespace-dev
No resources found in my-namespace-dev namespace.
Restore process
As long as the namespace is over 3 hours old, it can be recovered using Velero.
First step is to find the name of the most recent backup of the allnamespacebackup
schedule:
velero backup get
NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR
velero-allnamespacebackup-00000000000000 Completed 2020-03-30 12:00:37 +0100 BST 29d default <none>
Using the backup name, you can now run a Velero restore
command to restore the deleted namespace.
Restore command:
velero restore create <restore-name> --from-backup <backup-name> --include-namespaces <namespace> --wait
Example:
velero restore create my-namespace-dev-restore --from-backup velero-allnamespacebackup-00000000000000 --include-namespaces my-namespace-dev --wait
Once completed, you should be able to see the namespace and all resources recovered:
kubectl get all -n my-namespace-dev
NAME READY STATUS RESTARTS AGE
pod/pod1 2/2 Running 0 1m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/service1 ClusterIP None None 443/TCP 1m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/deployment1 1/1 1 1 1m
NAME DESIRED CURRENT READY AGE
replicaset.apps/replicaset1 1 1 1 1m
Losing a Kubernetes Component or Object
Impact
A loss of a kubernetes object such as a deployment, secret and Namespace can result in an application not working correctly. A loss of a Kubernetes component such as kube-scheduler or kube-control-manager can result in the cluster not working correctly.
Possible Cause
User error - Executing kubectl delete deployment/my-deployment-name ns my-namespace-dev
kubectl get deployment/my-deployment-name -n my-namespace-dev
Error from server (NotFound): deployments.extensions "my-deployment-name" not found.
Restore process
As long as the namespace is over 3 hours old, it can be recovered using Velero.
First step is to find the name of the most recent backup of the schedule:
velero backup get
NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR
velero-backup-00000000000000 Completed 2020-03-30 12:00:37 +0100 BST 29d default <none>
Using the backup name, you can now run a Velero restore
command to restore the deleted resource.
Restore command:
velero restore create <restore-name> --from-backup <backup-name> --include-namespaces <namespace> --wait
Example:
velero restore create my-namespace-dev-restore --from-backup velero-backup-00000000000000 --include-namespaces my-namespace-dev --wait
Once completed, you should be able to see the resources recovered:
kubectl get deployment/my-deployment-name -n my-namespace-dev
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/my-deployment-name 1/1 1 1 1m
Losing the whole cluster
Impact
Severity : HIGH
Likelihood: LOW
Losing the whole kubernetes cluster will result in all services not being available. This means all the resources within all namespaces, networking, monitoring and logging systems.
Possible Cause
Losing the whole cluster is less likely and the possible causes may vary.
User error - Deleting all EC2 instances of <cluster-name>
from console by accident
The first sign of this happening is likely to be Pagerduty alarms to the high/low priority slack channels.
How this plan is tested:
This way of restoring the whole cluster have been tested with below procedure
- Create cluster using the script
- Deploy starter pack to replicate user namespaces
- Take backup of the whole cluster using velero
- Destroy the cluster excluding the VPC using the script
- Create a cluster with the same name. That way velero can match the backup storage location
- Restore the cluster from the backup taken in Step 3
Assumptions
- VPC of the cluster is not deleted
- Not all Kubernetes resources are tested for Recovery
- Not all AWS resources are tested for reconnecting to application after restore
- Smoke tests are run excluding one for
live
- Terraform state have been recreated
Restore process
Any namespaces over 3 hours old can be recovered using Velero (newer namespaces might not have been backed up before the incident occurred).
Create the cluster with the same name from the source code
and provide the exisiting vpc-name
. This will link the velero backup locations to the lost cluster.
Find the name of the most recent backup of the allnamespacebackup
schedule:
velero backup get
NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR
velero-allnamespacebackup-00000000000000 Completed 2020-03-30 12:00:37 +0100 BST 29d default <none>
Using the backup name, you can now run a velero restore
command to restore the deleted namespace.
Restore command:
velero restore create <restore-name> --from-backup <backup-name> --wait
Example:
velero restore create my-namespace-dev-restore --from-backup velero-allnamespacebackup-00000000000000 --wait
Once completed, you should be able to see resources in all namespaces recovered.
Check the status of the restore and look for errors and warnings:
velero restore describe velero-allnamespacebackup-00000000000000-00000000000000
Check the logs of the restore:
velero restore logs velero-allnamespacebackup-00000000000000-00000000000000
Deleted terraform state
Severity : low
Likelihood: medium
Impact
Items removed from the terraform state are not physically destroyed but no longer managed by terraform. This means infrastructure is orphaned with no state.
For example, if you remove an AWS instance from the state, the AWS instance will continue running, but terraform plan will no longer see that instance.
Possible Cause
User error - Executing terraform state rm module.starter_pack.kubernetes_namespace.starter_pack
will remove kubernetes starter_pack namespace from terraform state. When state is removed, running “terraform apply” will raise error, as terraform cannot track about existing resource and try to create new one.
terraform apply -target=module.starter_pack.kubernetes_namespace.starter_pack
Error: namespaces "starter-pack" already exists
Restore process
In order to regain control of resource removed from terraform state, we will use the import command to recover the metadata for this resource. Import requires a resource address and the resource ID. The resource address is a combination of the resource type and resource name.
terraform import module.starter_pack.kubernetes_namespace.starter_pack starter-pack
The import command creates a new state file pulling in information about the specified kubernetes_namespace.
module.starter_pack.kubernetes_namespace.starter_pack: Importing from ID "starter-pack"...
module.starter_pack.kubernetes_namespace.starter_pack: Refreshing state... [id=starter-pack]
Import successful!
The resources that were imported are shown above. These resources are now in
your Terraform state and will henceforth be managed by Terraform.
Now running terraform apply after import, will not raise any error.
terraform apply -target=module.starter_pack.kubernetes_namespace.starter_pack
Apply complete! Resources: 0 added, 0 changed, 0 destroyed.
Recovering more complex scenarios
There are scenarios, where terraform state is corrupted or terraform state is removed for the whole module.
Executing terraform state rm module.starter_pack
, will remove all resources in “starter_pack” module from terraform state.
terraform plan -target=module.starter_pack
Plan: 7 to add, 0 to change, 0 to destroy.
In this scenario, terraform state can be restored from the remote_state stored in the terraform backend S3 bucket.
For example eks/components state is stored in “aws-accounts/cloud-platform-aws/vpc/eks/components” s3 bucket as defined here-eks.
Access the S3 bucket where the effected terraform state is stored. From the list of terraform.tfstate file versions, identify the file before the state got removed and download as terraform.tfstate. Upload the file again, this will set uploaded file as latest version.
Now running terraform plan will show, infrastructure is up-to-date.
terraform plan -target=module.starter_pack
No changes. Infrastructure is up-to-date.
Resolving a PartiallyFailed backup alert
A backup may fail and trigger an alert in lower-priority-alerts
. inspect the backup job:
kubectl get backup -n velero | grep -C 30 YYYYMMDD
You identify the failed backup phase: PartiallyFailed
and there will also by an errors
field with a count.
To understand the cause of the alert pull out the error messages from the velero pod from kibana:
kubernetes.pod_name: velero-<container-id> and log: "level=error"
Sometimes the cause of the alert can be genuine, for instance a volume may have been removed (pod restart during a backup):
level=error msg="Error backing up item" backup=velero/velero-allnamespacebackup-20231120090023 error="error getting volume info: rpc error: code = Unknown desc = InvalidVolume.NotFound: The volume 'vol-08d317558ab5bd46b' does not exist.\n\tstatus code: 400