Incident Log

Q4 2025 (October-December)

Incident on 2025-10-13 - RDS t4g instances unavailable

Q3 2025 (July-September)

Incident on 2025-07-23 - Multiple secrets leaked to public

Q1 2025 (January-March)

Incident on 2025-01-13 - Auth0 Terraform provider credentials exposed
Incident on 2025-03-18 - AWS EKS Upgrade to 1.30 - Descheduler attempted to reschedule all pods on cordoned nodes simultaneously

Q3 2024 (July-September)

Mean Time to Repair: 1h 39m
Mean Time to Resolve: 2h 14m
Incident on 2024-09-20 - EKS Subnet Route Table Associations destroyed
Incident on 2024-07-25 - Elasticsearch no longer receiving logs

Q2 2024 (April-June)

Mean Time to Repair: 3h 21m
Mean Time to Resolve: 21h 20m
Incident on 2024-04-15 - Prometheus restarted during WAL reload several times which resulted in missing metrics

Q4 2023 (October-December)

Mean Time to Repair: 35h 36m
Mean Time to Resolve: 35h 36m
Incident on 2023-11-01 - Prometheus restarted several times which resulted in missing metrics

Q3 2023 (July-September)

Mean Time to Repair: 10h 55m
Mean Time to Resolve: 19h 21m
Incident on 2023-09-18 - Lack of Disk space on nodes
Incident on 2023-08-04 - Dropped logging in kibana
Incident on 2023-07-25 - Prometheus on live cluster DOWN
Incident on 2023-07-21 - VPC CNI not allocating IP addresses

Q2 2023 (April-June)

Mean Time to Repair: 0h 55m
Mean Time to Resolve: 0h 55m
Incident on 2023-06-06 - User services down

Q1 2023 (January-March)

Mean Time to Repair: 225h 10m
Mean Time to Resolve: 225h 28m
Incident on 2023-02-02 - CJS Dashboard Performance
Incident on 2023-01-11 - Cluster image pull failure due to DockerHub password rotation
Incident on 2023-01-05 - CircleCI Security Incident

Q4 2022 (October-December)

Mean Time to Repair: 27m
Mean Time to Resolve: 27m
Incident on 2022-11-15 - Prometheus eks-live DOWN

Q3 2022 (July-September)

Mean Time to Repair: 6h 27m
Mean Time to Resolve: 6h 27m
Incident on 2022-07-11 - Slow performance for 25% of ingress traffic

Q1 2022 (January to March)

Mean Time to Repair: 1h 05m
Mean Time to Resolve: 1h 24m
Incident on 2022-03-10 - All ingress resources using *.apps.live.cloud-platform urls showing certificate issue
Incident on 2022-01-22 - some DNS records got deleted at the weekend

Q4 2021 (October to December)

Mean Time to Repair: 1h 17m
Mean Time to Resolve: 1h 17m
Incident on 2021-11-05 - ModSec ingress controller is erroring

Q3 2021 (July-September)

Mean Time to Repair: 3h 28m
Mean Time to Resolve: 11h 4m
Incident on 2021-09-30 - SSL Certificate Issue in browsers
Incident on 2021-09-04 - Pingdom check Prometheus Cloud-Platform - Healthcheck is DOWN
Incident on 2021-07-12 - All ingress resources using *apps.live-1 domain names stop working

Q2 2021 (April-June)

Mean Time to Repair: 2h 32m
Mean Time to Resolve: 2h 44m
Incident on 2021-06-09 - All users are unable to create new ingress rules, following bad ModSec Ingress-controller upgrade
Incident on 2021-05-10 - Apply Pipeline downtime due to accidental destroy of Manager cluster

Q1 2021 (January - March)

Mean Time to Repair: N/A
Mean Time to Resolve: N/A

No incidents declared

Q4 2020 (October - December)

Mean Time to Repair: 2h 8m
Mean Time to Resolve: 8h 46m
Incident on 2020-10-06 - Intermittent “micro-downtimes” on various services using dedicated ingress controllers

Q3 2020 (July - September)

Mean Time To Repair: 59m
Mean Time To Resolve: 7h 13m
Incident on 2020-09-28 - Termination of nodes updating kops Instance Group
Incident on 2020-09-21 - Some cloud-platform components destroyed
Incident on 2020-09-07 - All users are unable to create new ingress rules
Incident on 2020-08-25 - Connectivity issues with eu-west-2a
Incident on 2020-08-14 - Ingress-controllers crashlooping
Incident on 2020-08-07 - Master node provisioning failure

Q2 2020 (April - June)

Mean Time To Repair: 2h 49m
Mean Time To Resolve: 7h 12m
Incident on 2020-08-04
Incident on 2020-04-15 Nginx/TLS

Q1 2020 (January - March)

Mean Time To Repair: 1h 22m
Mean Time To Resolve: 2h 36m
Incident on 2020-02-25
Incident on 2020-02-18
Incident on 2020-02-12

About this incident log

The purpose of publishing this incident log:

for the Cloud Platform team to learn from incidents
for the Cloud Platform team and its stakeholders to track incident trends and performance
because we operate in the open

Definitions:

The words used in the timeline of an incident: fault occurs, team becomes aware (of something bad), incident declared (the team acknowledges and has an idea of the impact), repaired (system is fully functional), resolved (fully functional and future failures are prevented)
Incident time - The start of the failure (Before March 2020 it was the time the incident was declared)
Time to Repair - The time between the incident being declared (or when the team became aware of the fault) and when service is fully restored. Only includes Hours of Support.
Time to Resolve - The time between when the fault occurs and when system is fully functional (and include any immediate work done to prevent future failures). Only includes Hours of Support. This is a broader metric of incident response performance, compared to Time to Repair.

Source: Atlassian

Datestamps: please use YYYY-MM-DD HH:MM (almost ISO 8601, but more readable), for the London timezone

Template

Incident on YYYY-MM-DD - [Brief description]

Key events
- First detected YYYY-MM-DD HH:MM
- Incident declared YYYY-MM-DD HH:MM
- Repaired YYYY-MM-DD HH:MM
- Resolved YYYY-MM-DD HH:MM
Time to repair: Xh Xm
Time to resolve: Xh Xm
Identified:
Impact:
Context:
- Timeline: [Timeline](url of google document) for the incident
- Slack thread: [Slack thread](url of primary incident thread) for the incident.
Resolution:
Review actions: