Incident on 2022-01-22 - some DNS records got deleted at the weekend
Key events
- First detected 2022-01-22 11:57
- Incident declared 2022-01-22 14:41
- Repaired 2022-01-22 13:59
- Resolved 2022-01-22 14:38
Time to repair: 2h 2m
Time to resolve: 2h 41m
Identified: Pingdom alerted an LAA developer to some of their sites becoming unavailable. They reported this to CP team via Slack #ask-cloud-platform, and the messages were spotted by on-call engineers
Impact:
- Sites affected:
- 2 production sites were unavailable:
- laa-fee-calculator-production.apps.live-1.cloud-platform.service.justice.gov.uk
- legal-framework-api.apps.live-1.cloud-platform.service.justice.gov.uk
- 3 production sites had minor issues - unavailable on domains that only MOJ staff use
- 46 non-production sites were unavailable on domains that only MOJ staff use
- Impact on users was negligible. The 2 sites that external users would have experienced the unavailability are typically used by office staff, for generally non-urgent work, whereas this incident occurred during the weekend.
Context:
- Timeline: Timeline for the incident
- Slack thread: Slack thread for the incident.
Resolution:
- external-dns was trying to restore the DNS records, but it was receiving errors when writing, due to missing annotations (external-dns.alpha.kubernetes.io/aws-weight) in an unrelated ingress. Manually adding the annotations restored the DNS.
Review actions:
- Create guidance about internal traffic and domain names, and advertise to users in slack #3497
- Create pingdom alerts for test helloworld apps #3498
- Investigate if external-dns sync functionality is enough for the DNS cleanup #3499
- Change the ErrorsInExternalDNS alarm to high priority #3500
- Create a runbook to handle ErrorsInExternalDNS alarm #3501
- Assign someone to be the ‘hammer’ on Fridays