Incident Log
Q3 2024 (July-September)
- Mean Time to Repair: 1h 39m
Mean Time to Resolve: 2h 14m
Incident on 2024-09-20 - EKS Subnet Route Table Associations destroyed
Incident on 2024-07-25 - Elasticsearch no longer receiving logs
Q1 2024 (January-April)
Mean Time to Repair: 3h 21m
Mean Time to Resolve: 21h 20m
Q4 2023 (October-December)
Mean Time to Repair: 35h 36m
Mean Time to Resolve: 35h 36m
Incident on 2023-11-01 - Prometheus restarted several times which resulted in missing metrics
Q3 2023 (July-September)
Mean Time to Repair: 10h 55m
Mean Time to Resolve: 19h 21m
Incident on 2023-07-21 - VPC CNI not allocating IP addresses
Q2 2023 (April-June)
Mean Time to Repair: 0h 55m
Mean Time to Resolve: 0h 55m
Q1 2023 (January-March)
Mean Time to Repair: 225h 10m
Mean Time to Resolve: 225h 28m
Incident on 2023-01-11 - Cluster image pull failure due to DockerHub password rotation
Q4 2022 (October-December)
Mean Time to Repair: 27m
Mean Time to Resolve: 27m
Q3 2022 (July-September)
Mean Time to Repair: 6h 27m
Mean Time to Resolve: 6h 27m
Incident on 2022-07-11 - Slow performance for 25% of ingress traffic
Q1 2022 (January to March)
Mean Time to Repair: 1h 05m
Mean Time to Resolve: 1h 24m
Incident on 2022-01-22 - some DNS records got deleted at the weekend
Q4 2021 (October to December)
Mean Time to Repair: 1h 17m
Mean Time to Resolve: 1h 17m
Incident on 2021-11-05 - ModSec ingress controller is erroring
Q3 2021 (July-September)
Mean Time to Repair: 3h 28m
Mean Time to Resolve: 11h 4m
Incident on 2021-09-04 - Pingdom check Prometheus Cloud-Platform - Healthcheck is DOWN
Incident on 2021-07-12 - All ingress resources using *apps.live-1 domain names stop working
Q2 2021 (April-June)
Mean Time to Repair: 2h 32m
Mean Time to Resolve: 2h 44m
Incident on 2021-05-10 - Apply Pipeline downtime due to accidental destroy of Manager cluster
Q1 2021 (January - March)
Mean Time to Repair: N/A
Mean Time to Resolve: N/A
No incidents declared
Q4 2020 (October - December)
Mean Time to Repair: 2h 8m
Mean Time to Resolve: 8h 46m
Q3 2020 (July - September)
Mean Time To Repair: 59m
Mean Time To Resolve: 7h 13m
Incident on 2020-09-28 - Termination of nodes updating kops Instance Group
Incident on 2020-09-21 - Some cloud-platform components destroyed
Incident on 2020-09-07 - All users are unable to create new ingress rules
Incident on 2020-08-25 - Connectivity issues with eu-west-2a
Q2 2020 (April - June)
Mean Time To Repair: 2h 49m
Mean Time To Resolve: 7h 12m
Q1 2020 (January - March)
Mean Time To Repair: 1h 22m
Mean Time To Resolve: 2h 36m
About this incident log
The purpose of publishing this incident log:
- for the Cloud Platform team to learn from incidents
- for the Cloud Platform team and its stakeholders to track incident trends and performance
- because we operate in the open
Definitions:
- The words used in the timeline of an incident: fault occurs, team becomes aware (of something bad), incident declared (the team acknowledges and has an idea of the impact), repaired (system is fully functional), resolved (fully functional and future failures are prevented)
- Incident time - The start of the failure (Before March 2020 it was the time the incident was declared)
- Time to Repair - The time between the incident being declared (or when the team became aware of the fault) and when service is fully restored. Only includes Hours of Support.
- Time to Resolve - The time between when the fault occurs and when system is fully functional (and include any immediate work done to prevent future failures). Only includes Hours of Support. This is a broader metric of incident response performance, compared to Time to Repair.
Source: Atlassian
Datestamps: please use YYYY-MM-DD HH:MM
(almost ISO 8601, but more readable), for the London timezone
Template
Incident on YYYY-MM-DD - [Brief description]
Key events
- First detected YYYY-MM-DD HH:MM
- Incident declared YYYY-MM-DD HH:MM
- Repaired YYYY-MM-DD HH:MM
- Resolved YYYY-MM-DD HH:MM
Time to repair: Xh Xm
Time to resolve: Xh Xm
Identified:
Impact:
Context:
- Timeline:
[Timeline](url of google document)
for the incident - Slack thread:
[Slack thread](url of primary incident thread)
for the incident.
- Timeline:
Resolution:
Review actions: