Incident on 2020-09-21 - Some cloud-platform components destroyed
Key events
- First detected 2020-09-21 18:27
- Incident declared 2020-09-21 18:40
- Repaired 2020-09-21 19:05
- Resolved 2020-09-21 21:41
Time to repair: 0h 38m
Time to resolve: 3h 14m
Identified: Some components of our production kubernetes cluster (live-1) were accidentally deleted, this caused some services running on cloud-platform gone down.
Impact:
- Some users could not access services running on the Cloud Platform.
- Prometheus/alertmanager/grafana is not accessible.
- kibana is not accessible.
- Cannot create new certificates.
Context:
- Test cluster deletion script triggered to delete a test cluster, kube context incorrectly targeted the live-1 cluster and deleted some cloud-platform components.
- Components include default ingress-controller, prometheus-operator, logging, cert-manager, kiam and external-dns. As ingress-controller gone down some users could not access services running on the Cloud Platform.
- Formbuilder services not accessible even after ingress-controller is restored.
- Timeline: Timeline for the incident.
- Slack thread: Slack thread for the incident.
Resolution:
- Team prioritised to restore default ingress controller, ingress-controller has a dependency of external-dns to update route53 records with new NLB and kiam for providing AWSAssumeRole for external-dns, these components (ingress-controller, external-dns and kiam) got restored successfully. Services start to come back up.
- Formbuilder services are still pointing to the old NLB (network load balancer before ingress got replaced), reason for this is route53 TXT records was set incorrect owner field, so external-dns couldn’t update the new NLB information in the A record. Team fixed the owner information in the TXT record, external DNS updated formbuilder route53 records to point to new NLB. Formbuilder services is up and running.
- Team did target apply to restore remaining components.
- Apply pipleine run to restore all the certificates, servicemonitors and prometheus-rules from the environment repository.