Skip to main content

Incident Log

Use the [mean-time-to-repair] go script to view performance metrics


Q3 2024 (July-September)

  • Mean Time to Repair: 1h 39m
  • Mean Time to Resolve: 2h 14m

Incident on 2024-09-20 - EKS Subnet Route Table Associations destroyed

  • Key events

    • First detected: 2024-09-20 11:24
    • Incident declared: 2024-09-20 11:30
    • Repaired: 2024-09-20 11:33
    • Resolved: 2024-09-20 11:40
  • Time to repair: 11m

  • Time to resolve: 20m

  • Identified: High priority pingdom alerts for live cluster services and users reporting that services could not be resolved.

  • Impact: Cloud Platform services were not available for a period of time.

  • Context:

    • 2024-09-20 11:21: infrastructure-vpc-live-1 pipeline unpaused
    • 2024-09-20 11:22: EKS Subnet route table associations are destroyed by queued PR infra pipeline
    • 2024-09-20 11:24: Cloud platform team alerted via High priority alarm
    • 2024-09-20 11:26: teams begin reporting in #ask channel that services are unavailable
    • 2024-09-20 11:32: CP team re-run local terraform apply to rebuild route table associations
    • 2024-09-20 11:33: CP team communicate to users that service availability is restored
    • 2024-09-20 11:40: Incident declared as resolved
  • Resolution:

    • Cloud Platform infrastructure pipelines had been paused for an extended period of time in order to carry out required manual updates to Terraform remote state. Upon resuming the infrastructure pipeline, a PR which had not been identified by the team during this time was queued up to run. This PR executed automatically and destroyed subnet route table configurations, disabling internet routing to Cloud Platform services.
    • Route table associations were rebuilt by running Terraform apply manually, restoring service availability.
  • Review actions:

    • Review and update the process for pausing and resuming infrastructure pipelines to ensure that all team members are aware of the implications of doing so.
    • Investigate options for suspending the execution of queued PRs during periods of ongoing manual updates to infrastructure.
    • Investigate options for improving isolation of infrastructure plan and apply pipeline tasks.

Incident on 2024-07-25

  • Key events

    • First detected: 2024-07-25 12:10
    • Incident declared: 2024-07-25 14:54
    • Repaired declared: 2024-07-25 15:18
    • Resolved 2024-07-25 16:19
  • Time to repair: 3h 8m

  • Time to resolve: 4h 9m

  • Identified: User reported that Elasticsearch was no longer receiving logs

  • Impact: Elasticsearch and Opensearch did not recieve logs, this meant that we lost users logs for the period of the incident. These logs have not been recovered.

  • Context:

    • 2024-07-25 12:10: cp-live-app-logs - ClusterIndexWritesBlocked starts
    • 2024-07-25 12:30: cp-live-app-logs - ClusterIndexWritesBlocked recovers
    • 2024-07-25 12:50: cp-live-app-logs - ClusterIndexWritesBlocked recovers
    • 2024-07-25 12:35: cp-live-app-logs - ClusterIndexWritesBlocked starts
    • 2024-07-25 12:55: cp-live-app-logs - ClusterIndexWritesBlocked starts
    • 2024-07-25 13:15: cp-live-app-logs - ClusterIndexWritesBlocked recovers and starts
    • 2024-07-25 13:40: cp-live-app-logs - ClusterIndexWritesBlocked recovers and starts
    • 2024-07-25 13:45: Kibana no longer receiving any logs
    • 2024-07-25 14:27: User notifies team via #ask-cloud-platform that Kibana has not been receiving logs since 13:45.
    • 2024-07-25 14:32: Initial investigation shows no problems in live monitoring namespace
    • 2024-07-25 14:42: Google meet call started to triage
    • 2024-07-25 14:54: Incident declared
    • 2024-07-25 14:55: Logs from fluent-bit containers show “could not enqueue into the ring buffer”
    • 2024-07-25 14:59: rollout restart of all fluent-bit containers, logs partially start flowing but after a few minutes show the same error message
    • 2024-07-25 15:18: It is noted that Opensearch is out of disk space, this is increased from 8000 to 12000
    • 2024-07-25 15:58: Disk space increase is complete and we start seeing fluent-bit processing logs
    • 2024-07-25 16:15: Remediation tasks are defined and started to action
    • 2024-07-25 16:19: Incident declared resolved
  • Resolution:

    • Opensearch disk space is increased from 8000 to 12000
    • Fluentbit is configured to not log to Opensearch as a temporary measure whilst follow-up investigation work into root cause is carried out.
  • Review actions:


Q1 2024 (January-April)

  • Mean Time to Repair: 3h 21m

  • Mean Time to Resolve: 21h 20m

Incident on 2024-04-15 - Prometheus restarted during WAL reload several times which resulted in missing metrics

  • Key events

    • First detected: 2024-04-15 12:32
    • Incident declared: 2024-04-15 14.43
    • Repaired: 2024-04-15 15:53
    • Resolved 2024-04-18 16:13
  • Time to repair: 3h 21m

  • Time to resolve: 21h 20m

  • Identified: Team observed that the prometheus pod was restarted several times after a planned prometheus change

  • Impact: Prometheus is not Available. The Cloud Platform lose the monitoring for a period of time.

  • Context:

    • 2024-04-15 12:32: Prometheus was not available after a planned change
    • 2024-04-15 12:52: Found that the WAL reload was not completing and triggering a restart before completing
    • 2024-04-15 13:00: Update send to users about the issue with Prometheus
    • 2024-04-15 12:57: Planned change reverted to exclude that as a root cause but that didnt help
    • 2024-04-15 13:46: Debugged the log shows that startupProbe failed event
    • 2024-04-15 15:21: Increasing the StartupProbe to a higher value to 30 mins. The default is 15 mins – 2024-04-15 15:53: Applied the change to increase startupProbe, Prometheus has become available, Incident repaired
    • 2024-04-15 16:00: Users updated with the Prometheus Status
    • 2024-04-18 16:13: Team identified the reason for the longer WAL reload and recorded findings, Incident Resolved.
  • Resolution:

    • During the planning restart, the WAL count of Prometheus was higher and hence the reload takes too much time that the default startupProbe was not enough
    • Increasing the startupProbe threshold allowed the WAL reload to complete
  • Review actions:

    • Team discussed about performing planned prometheus restarts when the WAL count is lower to reduce the restart time
    • The default CPU and Memory requests were set to meet the maximum usage
    • Create a test setup to recreate live WAL count
    • Explore memory-snapshot-on-shutdown and auto-gomaxprocs feature flag options
    • Explore remote storage of WAL files to a different location
    • Look into creating a blue-green prometheus to have live like setup to test changes before applying to live
    • Spike into Amazon Managed Prometheus

Q4 2023 (October-December)

  • Mean Time to Repair: 35h 36m

  • Mean Time to Resolve: 35h 36m

Incident on 2023-11-01 10:41 - Prometheus restarted several times which resulted in missing metrics

  • Key events

    • First detected: 2023-11-01 10:15
    • Incident declared: 2023-11-01 10:41
    • Repaired: 2023-11-03 14:38
    • Resolved 2023-11-03 14:38
  • Time to repair: 35h 36m

  • Time to resolve: 35h 36m

  • Identified: PagerDuty High Priority alert from Pingdom that Prometheus - live healthcheck is DOWN

  • Impact: Prometheus is not Available. The Cloud Platform lose the monitoring for a period of time.

  • Context:

    • 2023-11-01 10:15: PagerDuty High Priority alert from Pingdom that Prometheus - live healthcheck is DOWN. Team acknowledged and checked the state of the Prometheus server.
    • 2023-11-01 10:41: PagerDuty for Prometheus alerted 3rd time in a row in just few minutes interval. Incident declared
    • 2023-11-01 10:41: Prometheus pod has restarted and the prometheus container is starting
    • 2023-11-01 10:41: Prometheus logs shows there are numerous Evaluation rule failed
    • 2023-11-01 10:41: Events in monitoring namespace recorded Readiness Probe failed for Prometheus
    • 2023-11-01 12:35: Team enabled debug log level for prometheus to understand the issue
    • 2023-11-03 16:01: After investigating the logs, team found that one possible root cause might be the readiness Probe failure prior to the restart of prometheus. Hence team increased the readiness probe timeout
    • 2023-11-03 16:01: Incident repaired and resolved.
  • Resolution:

    • Team identified that the readiness probe was failing and the prometheus was restarted.
    • Increased the readiness probe timeout from 3 to 6 seconds to avoid the restart of prometheus
  • Review actions:

    • Team discussed about having closer inspection and try to identify these kind of failures earlier
    • Investigate if the ingestion of data to the database too big or long
    • Is executing some queries make prometheus work harder and stop responding to the readiness probe
    • Any other services which is probing prometheus that triggers the restart
    • Is taking regular velero backups distrub the ebs read/write and cause the restart

Q3 2023 (July-September)

  • Mean Time to Repair: 10h 55m

  • Mean Time to Resolve: 19h 21m

Incident on 2023-09-18 15:12 - Lack of Disk space on nodes

  • Key events

    • First detected: 2023-09-18 13:42
    • Incident declared: 2023-09-18 15:12
    • Repaired: 2023-09-18 17:54
    • Resolved 2023-09-20 19:18
  • Time to repair: 4h 12m

  • Time to resolve: 35h 36m

  • Identified: User reported that they are seeing ImagePull errors no space left on device error

  • Impact: Several nodes are experiencing a lack of disk space within the cluster. The deployments might not be scheduled consistently and may fail.

  • Context:

    • 2023-09-18 13:42 Team noticed RootVolUtilisation-Critical in High-priority-alert channel
    • 2023-09-18 14:03 User reported that they are seeing ImagePull errors no space left on device error
    • 2023-09-18 14:27 Team were doing the EKS Module upgrade to 18 and draining the nodes. They were seeing numerous pods in Evicted and ContainerStateUnKnown state
    • 2023-09-18 15:12 Incident declared. https://mojdt.slack.com/archives/C514ETYJX/p1695046332665969
    • 2023-09-18 15.26 Compared the disk size allocated in old node and new node and identified that the new node was allocated only 20Gb of disk space
    • 2023-09-18 15:34 Old default node group uncordoned
    • 2023-09-18 15:35 New nodes drain started to shift workload back to old nodegroup
    • 2023-09-18 17:54 Incident repaired
    • 2023-09-19 10:30 Team started validating the fix and understanding the launch_template changes
    • 2023-09-20 10:00 Team updated the fix on manager and later on live cluster
    • 2023-09-20 12:30 Started draining the old node group
    • 2023-09-20 15:04 There was some increased pod state of “ContainerCreating”
    • 2023-09-20 15:25 There was increased number of "failed to assign an IP address to container" eni error. Checked the CNI logs Unable to get IP address from CIDR: no free IP available in the prefix Understood that this might be because of IP Prefix starving and some are freed when draining old nodes.
    • 2023-09-20 19:18 All nodes drained and No pods are in errored state. The initial issue of disk space issue is resolved
  • Resolution:

    • Team identified that the disk space was reduced from 100Gb to 20Gb as part of EKS Module version 18 change
    • Identified the code changes to launch template and applied the fix
  • Review actions:

    • Update runbook to compare launch template changes during EKS module upgrade
    • Create Test setup to pull images similar to live with different sizes
    • Update RootVolUtilisation alert runbook to check disk space config
    • Scale coreDNS dynamically based on the number of nodes
    • Investigate if we can use ipv6 to solve the IP Prefix starvation problem
    • Add drift testing to identify when a terraform plan shows a change to the launch template
    • Setup logging to view cni and ipamd logs and setup alerts to notify when there are errors related to IP Prefix starvation

Incident on 2023-08-04 10:09 - Dropped logging in kibana

  • Key events

    • First detected: 2023-08-04 09:14
    • Incident declared: 2023-08-04 10:09
    • Repaired: 2023-08-10 12:28
    • Resolved 2023-08-10 14:47
  • Time to repair: 33h 14m

  • Time to resolve: 35h 33m

  • Identified: Users reported in #ask-cloud-platform that they are seeing long periods of missing logs in Kibana.

  • Impact: The Cloud Platform lose the application logs for a period of time.

  • Context:

    • 2023-08-04 09:14: Users reported in #ask-cloud-platform that they are seeing long periods of missing logs in Kibana.
    • 2023-08-04 10:03: Cloud Platform team started investigating the issue and restarted the fluebt-bit pods
    • 2023-08-04 10:09: Incident declared. https://mojdt.slack.com/archives/C514ETYJX/p1691140153374179
    • 2023-08-04 12:03: Identified that the newer version fluent-bit has changes to the chunk drop strategy
    • 2023-08-04 16:00: Team bumped the fluent-bit version to see any improvements
    • 2023-08-07 10:30: Team regrouped and discuss troubleshooting steps
    • 2023-08-07 12:05: Increased the fluent-bit memory buffer
    • 2023-08-08 16:10: Implemented a fix to handle memory buffer overflow
    • 2023-08-09 09:00: Merged the fix and deployed in Live
    • 2023-08-10 11:42: Implemented to handle flush logs into smaller chunks
    • 2023-08-10 12:28: Incident repaired
    • 2023-08-10 14:47: Incident resolved
  • Resolution:

    • Team identified that the latest version of fluent-bit has changes to the chunk drop strategy
    • Implemented a fix to handle memory buffer overflow by writing to the file system and handling flush logs into smaller chunks
  • Review actions:

    • Push notifications from logging clusters to #lower-priority-alerts #4704
    • Add integration test to check that logs are being sent to the logging cluster

Incident on 2023-07-25 15:21 - Prometheus on live cluster DOWN

  • Key events

    • First detected: 2023-07-25 14:05
    • Incident declared: 2023-07-25 15:21
    • Repaired: 2023-07-25 15:55
    • Resolved 2023-09-25 15:55
  • Time to repair: 1h 50m

  • Time to resolve: 1h 50m

  • Identified: PagerDuty High Priority alert from Pingdom that Prometheus - live healthcheck is DOWN

  • Impact: Prometheus is not Available. The Cloud Platform lose the monitoring for a period of time.

  • Context:

    • 2023-07-25 14:05 - PagerDuty High Priority alert from Pingdom that Prometheus - live healthcheck is DOWN. Team acknowledged and checked the state of the Prometheus server. Prometheus errored for Rule evaluation and Exit code 137
    • 2023-07-25 14:09: Prometheus pod is in terminating state
    • 2023-07-25 14:17: The node where prometheus is running went to Not Ready state
    • 2023-07-25 14:22: Drain the monitoring node which moved the prometheus to the another monitoring node
    • 2023-07-25 14:56: After moving to new node the prometheus restarted just after coming back and put the node to Node Ready State
    • 2023-07-25 15:11: Comms went to cloud-platform-update on Prometheus was DOWN
    • 2023-07-25 15:20: Team found that the node memory is spiking to 89% and decided to go for a bigger instance size
    • 2023-07-25 15:21: Incident declared. https://mojdt.slack.com/archives/C514ETYJX/p1690294889724869
    • 2023-07-25 15:31: Changed the instance size to r6i.4xlarge
    • 2023-07-25 15:50: Still the Prometheus restarted after running. Team found the recent prometheus pod was terminated with OOMKilled. Increased the memory limits 100Gi
    • 2023-07-25 16:18: Updated the prometheus container limits:CPU - 12 core and 110 Gi Memory to accommodate the resource need for prometheus
    • 2023-07-25 16:18: Incident repaired
    • 2023-07-05 16:18: Incident resolved
  • Resolution:

    • Due to increase number of namespaces and prometheus rules, the prometheus server needed more memory. The instance size was not enough to keep the prometheus running.
    • Updating the node type to double the cpu and memory and increasing the container resource limit of prometheus server resolved the issue
  • Review actions:

    • Add alert to monitor the node memory usage and if a pod is using up most of the node memory #4538

Incident on 2023-07-21 09:31 - VPC CNI not allocating IP addresses

  • Key events

    • First detected: 2023-07-21 08:15
    • Incident declared: 2023-07-21 09:31
    • Repaired: 2023-07-21 12:42
    • Resolved 2023-07-21 12:42
  • Time to repair: 4h 27m

  • Time to resolve: 4h 27m

  • Identified: User reported of seeing issues with new deployments in #ask-cloud-platform

  • Impact: The service availability for CP applications may be degraded/at increased risk of failure.

  • Context:

    • 2023-07-21 08:15 - User reported of seeing issues with new deployments (stuck with ContainerCreating)
    • 2023-07-21 09:00 - Team started to put together the list of all effected namespaces
    • 2023-07-21 09:31 - Incident declared
    • 2023-07-21 09:45 - Team identified that the issue was affected 6 nodes and added new nodes and and began to cordon/drain affected nodes
    • 2023-07-21 12:35 - Compared cni settings on a 1.23 test cluster with live and found a setting was different
    • 2023-07-21 12:42 - Set the command to enable Prefix Delegation on the live cluster
    • 2023-07-21 12:42 - Incident repaired
    • 2023-07-21 12:42 - Incident resolved
  • Resolution:

    • The issue was caused by a missing setting on the live cluster. The team added the setting to the live cluster and the issue was resolved
  • Review actions:

    • Add a test/check to ensure the IP address allocation is working as expected #4669

Q2 2023 (April-June)

  • Mean Time to Repair: 0h 55m

  • Mean Time to Resolve: 0h 55m

Incident on 2023-06-06 11:00 - User services down

  • Key events

    • First detected: 2023-06-06 10:26
    • Incident declared: 2023-06-06 11:00
    • Repaired: 2023-06-06 11:21
    • Resolved 2023-06-06 11:21
  • Time to repair: 0h 55m

  • Time to resolve: 0h 55m

  • Identified: Several Users reported issues that the production pods are deleted all at once, and receiving pingdom alerts that their application is down for few minutes

  • Impact: User services were down for few minutes

  • Context:

    • 2023-06-06 10:23 - User reported that their production pods are deleted all at once
    • 2023-06-06 10:30 - Users reported that their services were back up and running.
    • 2023-06-06 10:30 - Team found that the nodes are being recycled all at a time during the node instance type change
    • 2023-06-06 10:50 - User reported that the DPS service is down because they couldnot authenticate into the service
    • 2023-06-06 11:00 - Incident declared
    • 2023-06-06 11:21 - User reported that the DPS service is back up and running
    • 2023-06-06 11:21 - Incident repaired
    • 2023-06-06 13:11 - Incident resolved
  • Resolution:

    • When the node instance type is changed, the nodes are recycled all at a time. This caused the pods to be deleted all at once.
    • Raised a ticket with AWS asking the steps to update the node instance type without causing outage to the services.
    • The instance type update is performed through terraform, hence the team will have to comeup with a plan and update runbook to perform these changes without downtime.
  • Review actions:

    • Add a runbook for the steps to perform when changing the node instance type

Q1 2023 (January-March)

  • Mean Time to Repair: 225h 10m

  • Mean Time to Resolve: 225h 28m

Incident on 2023-02-02 10:21 - CJS Dashboard Performance

  • Key events

    • First detected: 2023-02-02 10:14
    • Incident declared: 2023-02-02 10:20
    • Repaired: 2023-02-02 10:20
    • Resolved 2023-02-02 11:36
  • Time to repair: 0h 30m

  • Time to resolve: 1h 22m

  • Identified: CPU-Critical alert

  • Impact: Cluster is reaching max capacity. Multiple services might be affected.

  • Context:

    • 2023-02-02 10:14: CPU-Critical alert
    • 2023-02-02 10:21: Cloud Platform Team supporting with CJS deployment and noticed that the CJS team increased the pod count and requested more resources causing the CPU critical alert.
    • 2023-02-02 10:21 Incident is declared.
    • 2023-02-02 10:22 War room started.
    • 2023-02-02 10:25 Cloud Platform noticed that the CJS team have 100 replicas for their deployment and many CJS pods started crash looping, this is due to the Descheduler service RemoveDuplicates strategy plugin making sure that there is only one pod associated with a ReplicaSet running on the same node. If there are more, those duplicate pods are evicted for better spreading of pods in a cluster.
    • The live cluster has 60 nodes as desired capacity. As CJS have 100 ReplicaSet for their deployment, Descheduler started terminating the duplicate CJS pods scheduled on the same node. The restart of multiple CJS pods caused the CPU hike.
    • 2023-02-02 10:30 Cloud Platform team scaled down Descheduler to stop terminating CJS pods.
    • 2023-02-02 10:37 CJS Dash team planned to roll back a caching change they made around 10 am that appears to have generated the spike.
    • 2023-02-02 10:38 Decision made to Increase node count to 60 from 80, to support the CJS team with more pods and resources.
    • 2023-02-02 10:40 Autoscaling group bumped up to 80 - to resolve the CPU critical. Descheduler is scaled down to 0 to accommodate multiple pods on a node.
    • 2023-02-02 10:44 Resolved status for CPU-Critical high-priority alert.
    • 2023-02-02 11:30 Performance has steadied.
    • 2023-02-02 11:36 Incident is resolved.
  • Resolution:

    • Cloud-platform team scaling down Descheduler to let CJS team have 100 ReplicaSet in their deployment.
    • CJS Dash team rolled back a change that appears to have generated the spike.
    • Cloud-Platform team increasing the desired node count to 80.
  • Review actions:

    • Create an OPA policy to not allow deployment ReplicaSet greater than an agreed number by the cloud-platform team.
    • Update the user guide to mention related to OPA policy.
    • Update the user guide to request teams to speak to the cloud-platform team before if teams are planning to apply deployments which need large resources like pod count, memory and CPU so the cloud-platform team is aware and provides the necessary support.

Incident on 2023-01-11 14:22 - Cluster image pull failure due to DockerHub password rotation

  • Key events

    • First detected: 2023-01-11 14:22
    • Incident declared: 2023-01-11 15:17
    • Repaired: 2023-01-11 15:50
    • Resolved 2023-01-11 15:51
  • Time to repair: 1h 28m

  • Time to resolve: 1h 29m

  • Identified: Identified: Cloud Platform team member observed failed DockerHub login attempts error at 2023-01-11 14:22:

failed to fetch manifest: Head "https://registry-1.docker.io/v2/ministryofjustice/cloud-platform-tools/manifests/2.1": toomanyrequests: too many failed login attempts for username or IP address
  • Impact: Concourse and EKS cluster nodes unable to pull images from DockerHub for 1h 28m. ErrImagePull error reported by one user in #ask-cloud-platform at 2023-01-11 14:54.

  • Context:

    • 2023-01-11 14:22: Cloud Platform team member observed failed DockerHub login attempts error:
failed to fetch manifest: Head "https://registry-1.docker.io/v2/ministryofjustice/cloud-platform-tools/manifests/2.1": toomanyrequests: too many failed login attempts for username or IP address
  • 2023-01-11 14:34: Discovered that cluster DockerHub passwords do match the value stored in LastPass.
  • 2022-01-11 14:40 Concourse DockerHub password updated in cloud-platform-infrastructure terraform.tfvars repository.
  • 2023-01-11 14:51 Explanation revealed. DockerHub password was changed as part of LastPass remediation activities.
  • 2023-01-11 14:52 KuberhealtyDaemensetCheck reveals cluster is also unable to pull images https://mojdt.slack.com/archives/C8QR5FQRX/p1673448593904699
With error:
Check execution error: kuberhealthy/daemonset: error when waiting for pod to start: ErrImagePull
  • 2023-01-11 14:53 dockerconfig node update requirement identified
  • 2023-01-11 14:54 user reports ErrImagePull when creating port-forward pods affecting at least two namespaces.
  • 2023-01-11 14:56 EKS cluster DockerHub password updated in cloud-platform-infrastructure
  • 2023-01-11 15:01 Concourse plan of password update reveals launch-template will be updated, suggesting node recycle.
  • 2023-01-11 15:02 Decision made to update password in live-2 cluster to determine whether a node recycle will be required
  • 2023-01-11 15:11 Comms distributed in #cloud-platform-update and #ask-cloud-platform.
  • 2023-01-11 15:17 Incident is declared.
  • 2023-01-11 15:17 J Birchall assumes incident lead and scribe roles.
  • 2023-01-11 15:19 War room started
  • 2023-01-11 15:28 Confirmation that password update will force node recycles across live & manager clusters.
  • 2023-01-11 15:36 Decision made to restore previous DockerHub password, to allow the team to manage a clean rotation OOH.
  • 2023-01-11 15:40 DockerHub password changed back to previous value.
  • 2023-01-11 15:46 Check-in with reporting user that pod is now deploying - answer is yes.
  • 2023-01-11 15:50 Cluster image pulling observed to be working again.
  • 2023-01-11 15:51 Incident is resolved
  • 2023-01-11 15:51 Noted that live-2 is now set with invalid dockerconfig; no impact on users.
  • 2023-01-11 16:50 comms distributed in #cloud-platform-update.

    • Resolution: DockerHub password was restored back to value used by EKS cluster nodes & Concourse to allow an update and graceful recycle of nodes OOH.
    • Review actions: As part of remediation, we have switched from Dockerhub username and password to Dockerhub token specifically created for Cloud Platform. (Done)

Incident on 2023-01-05 08:56 - CircleCI Security Incident

  • Key events

    • First detected 2023-01-04 (Time TBC)
    • Incident declared: 2022-01-05 08:56
    • Repaired 2023-02-01 10:30
    • Resolved 2022-02-01 10:30
  • Time to repair: 673h 34m

  • Time to resolve: 673h 34m

  • Identified: CircleCI announced a security alert on 4th January 2023. Their advice was for any and all secrets stored in CircleCI to be rotated immediately as a cautionary measure.

  • Impact: Exposure of secrets stored within CircleCI for running various services associated with applications running on the Cloud Platform.

  • Context: Users of the Cloud Platform use CircleCI for CI/CD including deployments into the Cloud Platform. Access for CircleCI into the Cloud Platform is granted by generating a namespace enclosed service-account with required permission set by individual teams/users. As all service-account access/permissions were set based on user need, some service-accounts had access to all stored secrets within the namespace it was created in. As part of our preliminary investigation, it was also discovered service-accounts were shared between namespaces which exposed this incident wider than first anticipated. We made the decision that we need to rotate any and all secrets used within the cluster.

  • Resolution: Due to the unknown nature opf some of the secrets that may have been exposed a prioritised phased approach was created:

    • Phase 1 Rotate the secret access key all service-accounts named “circle-*” Rotate the secret access key for all other service-accounts Rotate all IRSA service-accounts
    • Phase 2 Rotate all AWS keys within namespaces which had a CircleCI service-account
    • Phase 3 Rotate all AWS keys within all other namespaces not in Phase 2
    • Phase 4 Create and publish guidance for users to rotate all other secrets within namespaces and AWS keys generated via a Cloud Platform Module
    • Phase 5 Clean up any other IAM/Access keys not managed via code within the AWS account.

Full detailed breakdown of events can be found in the postmortem notes.

  • Review actions:
    • Implement Trivy scanning for container vulnerability (Done)
    • Implement Secrets Manager
    • Propose more code to be managed in cloud-platform-environments repository
    • Look into a Terraform resource for CircleCI
    • Use IRSA instead of AWS Keys

Q4 2022 (October-December)

  • Mean Time to Repair: 27m

  • Mean Time to Resolve: 27m

Incident on 2022-11-15 16:03 - Prometheus eks-live DOWN.

  • Key events

    • First detected 2022-11-15 16:03
    • Incident declared: 2022-11-15 16:05
    • Repaired 2022-11-15 16:30
    • Resolved 2022-11-15 16:30
  • Time to repair: 27m

  • Time to resolve: 27m

  • Identified: High Priority Alarms - #347423 Pingdom check Prometheus Cloud-Platform - Healthcheck is DOWN / Resolved: #347424 Pingdom check cloud-platform monitoring Prometheus eks-live is DOWN.

  • Impact: Prometheus was unavailable for 27 minutes. Not reported at all by users in #ask-cloud-platforms slack channel.

  • Context:

    • On the 1st of November 14:49 AWS - notifications sent an email - advising that instance i-087e420c573463c08 (prometheus-operator) would be retired on the 15th of November 2022 at 16:00
    • On the 15th of November 2022 - work being carried out on a Kubernetes upgrade on the “manager” cluster. Cloud-platforms advised in slack in the morning that the instance on “manager” would be retired that very afternoon. It was thought therefore that that this would have little impact on the upgrade work. However the instance was in fact on the “live” cluster - not “manager”
    • The instance was retired by AWS at 16:00, Prometheus went down approx 16:03.
    • Because the node was killed by AWS, and not gracefully by us - it got stuck - the eks node stayed in a status of “not ready”, the pod stays as “terminated”
    • Note were users notified in “ask-cloud-platform” slack channel at approx 16:25, once it was determined that it was NOT to do with Kubernetes upgrade work on “manager” and therefore it would indeed be having an impact on the live system.
  • Resolution:

    • The pod was killed by us at approx 16:12, this therefore made the node go too.
  • Review actions:

    • If we had picked up on this retirement in “Live”- we could have recyled the node gracefully (cordon, drain and kill first), possibly straight way on the 1st of November (well in advance).
    • Therefore we need to find a way of not having these notification buried in our email inbox.
    • First course of action, to ask AWS if there is an recomended alterative way of notifying to our slack channel (an alert). be this by sns to slack or some other method
    • AWS Support Case ID 11297456601 raised
    • AWS advise received - ticket raised to investigate potential solutions: Implementation of notification of Scheduled Instance Retirements - to Slack. Investigate 2 potential AWS solutions#4264.

Q3 2022 (July-September)

  • Mean Time to Repair: 6h 27m

  • Mean Time to Resolve: 6h 27m

Incident on 2022-07-11 09:33 - Slow performance for 25% of ingress traffic

  • Key events

    • First detected 2022-07-11 09:33
    • Incident declared 2022-07-11 10:11
    • Repaired 2022-07-11 16:07
    • Resolved 2022-07-11 16:07
  • Time to repair: 6h 27m

  • Time to resolve: 6h 27m

  • Identified: Users reported in #ask-cloud-platform they’re experiencing slow performance of their applications some of the time.

  • Impact: Slow performance of 25% of ingress traffic

  • Context:

    • Following an AWS incident the day before, one of three network interfaces on the ‘default’ ingress controllers were experiencing slow performance.
    • AWS claim, “the health checking subsystem did not correctly detect some of your targets as unhealthy, which resulted in clients timing out when they attempted to connect to one of your Network Load Balancer (NLB) Elastic IP’s (EIPs)”.
    • AWS go onto say, “The Networkd Load Balancer (NLB) has a health checking subsystem that checks the health of each target, and if a target is detected as unhealthy it is removed from service. During this issue, the health checking subsystem was unaware of the health status of your the targets in one of the Availability Zones (AZ)”.
    • Timeline: timeline for the incident
    • Slack thread: #cloud-platform-update for the incident
  • Resolution:

    • AWS internal components have been restarted. AWS say, “The root cause was a latent software race condition that was triggered when some of the health checking instances were restarted. Since the health checking subsystem was unaware of the targets, it did not return a health check status for a specific Availability Zone (AZ) of the NLB”.
    • They (AWS) go onto say, “We restarted the health checking subsystem, which caused it to refresh the list of targets, after this the NLB was recovered in the impacted AZ”.
  • Review actions:

    • Mitigaton tickets raised following a post-incident review: https://github.com/ministryofjustice/cloud-platform/issues?q=is%3Aissue+is%3Aopen+post-aws-incident

Q1 2022 (January to March)

  • Mean Time to Repair: 1h 05m

  • Mean Time to Resolve: 1h 24m

Incident on 2022-03-10 11:48 - All ingress resources using *.apps.live.cloud-platform urls showing certificate issue

  • Key events

    • First detected 2022-03-10 11:48
    • Incident declared 2022-03-10 11.50
    • Repaired 2022-03-10 11:56
    • Resolved 2022-03-10 11.56
  • Time to repair: 8m

  • Time to resolve: 8m

  • Identified: Users reported in #ask-cloud-platform that they are seeing errors for CP domain urls. Hostname/IP does not match certificate's altnames

  • Impact: All ingress resources using the *apps.live.cloud-platform.service.justice.gov.uk have mismatched certificates.

  • Context:

    • Occurred immediately following a terraform apply to a test cluster
    • The change amended the default certificate of live cluster to *.apps.yy-1003-0100.cloud-platform.service.justice.gov.uk.
    • Timeline: timeline for the incident
    • Slack thread: #ask-cloud-platform for the incident
  • Resolution:

    • The immediate repair was to perform an inline edit of the default certificate in live. Adding the wildcard dnsNames *.apps.live,*.live, *.apps.live-1 and *.live-1 to the default certificate i.e. reverting the faulty change.
    • Further investigation followed finding the cause of the incident was actually was due to the environment variable KUBE_CONFIG set to the config path which had live context set
    • The terraform kubectl provider used to apply kubectl_manifest resources uses environment variable KUBECONFIG and KUBE_CONFIG_PATH. But it has been found that it can also use variable KUBE_CONFIG causing the apply of certificate to the wrong cluster.
  • Review actions:

    • Ticket raised to configure kubectl provider to use data source #3589

Incident on 2022-01-22 11:57 - some DNS records got deleted at the weekend

  • Key events

    • First detected 2022-01-22 11:57
    • Incident declared 2022-01-22 14:41
    • Repaired 2022-01-22 13:59
    • Resolved 2022-01-22 14:38
  • Time to repair: 2h 2m

  • Time to resolve: 2h 41m

  • Identified: Pingdom alerted an LAA developer to some of their sites becoming unavailable. They reported this to CP team via Slack #ask-cloud-platform, and the messages were spotted by on-call engineers

  • Impact:

    • Sites affected:
    • 2 production sites were unavailable:
      • laa-fee-calculator-production.apps.live-1.cloud-platform.service.justice.gov.uk
      • legal-framework-api.apps.live-1.cloud-platform.service.justice.gov.uk
    • 3 production sites had minor issues - unavailable on domains that only MOJ staff use
    • 46 non-production sites were unavailable on domains that only MOJ staff use
    • Impact on users was negligible. The 2 sites that external users would have experienced the unavailability are typically used by office staff, for generally non-urgent work, whereas this incident occurred during the weekend.
  • Context:

  • Resolution:

    • external-dns was trying to restore the DNS records, but it was receiving errors when writing, due to missing annotations (external-dns.alpha.kubernetes.io/aws-weight) in an unrelated ingress. Manually adding the annotations restored the DNS.
  • Review actions:

    • Create guidance about internal traffic and domain names, and advertise to users in slack #3497
    • Create pingdom alerts for test helloworld apps #3498
    • Investigate if external-dns sync functionality is enough for the DNS cleanup #3499
    • Change the ErrorsInExternalDNS alarm to high priority #3500
    • Create a runbook to handle ErrorsInExternalDNS alarm #3501
    • Assign someone to be the ‘hammer’ on Fridays

Q4 2021 (October to December)

  • Mean Time to Repair: 1h 17m

  • Mean Time to Resolve: 1h 17m

Incident on 2021-11-05 - ModSec ingress controller is erroring

  • Key events

    • First detected 2021-11-05 9:29
    • Repaired 2021-11-05 10:46
    • Incident declared 2021-11-05 9:29
    • Resolved 2021-11-05 10:46
  • Time to repair: 1h 17m

  • Time to resolve: 1h 17m

  • Identified: Low priority alarms

  • Impact:

    • No users reported issues. Impacted only one pod.
  • Context:

  • Resolution:

    • Pod restarted.
  • Review actions:

    • N/A

Q3 2021 (July-September)

  • Mean Time to Repair: 3h 28m

  • Mean Time to Resolve: 11h 4m

Incident on 2021-09-30 - SSL Certificate Issue in browsers

  • Key events

    • First detected 2021-09-30 15:31
    • Repaired 2021-10-01 10:29
    • Incident declared 2021-09-30 17:26
    • Resolved 2021-10-01 13:09
  • Time to repair: 5h 3m

  • Time to resolve: 7h 43m

  • Identified: User reported that they are getting SSL certificate errors when browsing sites which are hosted on Cloud Platform

  • Impact:

    • 300 LAA caseworkers and thousands of DOM1 users using CP-based digital services if it was during office hours. They had Firefox as a fallback and no actual reports.
    • Public users - No reports.
  • Context:

  • Resolution:

    • The new certificate was pushed to DOM1 and Quantum machines by the engineers who have been contracted to manage these devices
  • Review actions:

    • How to get latest announcements/ releases of components used in CP stack? Ticket raised #3262
    • Can we use AWS Certificate Manager instead of Letsencrypt? Ticket raised #3263
    • How would the team escalate a major incident e.g. CP goes down. Runbook page here
    • How we can get visibility of ServiceNow service issues for CP-hosted services. Ticket raised 3264

Incident on 2021-09-04 22:05 - Pingdom check Prometheus Cloud-Platform - Healthcheck is DOWN

  • Key events

    • First detected 2021-09-04 22:05
    • Repaired 2021-09-05 12:16
    • Incident declared 2021-09-05 12:53
    • Resolved 2021-09-05 12:27
  • Time to repair: 5h 16m

  • Time to resolve: 5h 27m

  • Identified: Prometheus Pod restarted several times with error OOMKilled causing Prometheus Healthcheck to go down

  • Impact:

    • The monitoring system of the cluster was not available
    • All application metrics were lost during that time period
  • Context:

  • Resolution:

    • Increased the memory limit for Prometheus container from 25Gi to 50Gi
  • Review actions:

    • Created a ticket to configure Thanos querier to query data for longer period
    • Created a ticket to add an alert to check when prometheus container hit 90% resource limit set
    • Created a ticket to create a grafana dashboard to display queries that take more than 1 minute to complete
    • Increase the memory limit for Prometheus container to 60GiPR #105
    • Test Pagerduty settings for weekends of the Cloud Platform on-call person to receive alarm immediately on the phone when a high priority alert is triggered

Incident on 2021-07-12 15:24 - All ingress resources using *apps.live-1 domain names stop working

  • Key events

    • First detected 2021-07-12 15:44
    • Repaired 2021-07-12 15:51
    • Incident declared 2021-07-12 16:09
    • Resolved 2021-07-13 11:49
  • Time to repair: 0h 07m

  • Time to resolve: 20h 03m

  • Identified: User reported in #ask-cloud-platform an error from the APM monitoring platform Sentry: Hostname/IP does not match certificate's altnames

  • Impact: All ingress resources using the *apps.live-1.cloud-platform.service.justice.gov.uk have mismatched certificates.

  • Context:

    • Occurred immediately following an upgrade to the default certificate of “live” clusters (PR here: https://github.com/ministryofjustice/cloud-platform-terraform-ingress-controller/pull/20)
    • The change amended the default certificate in the live-1 cluster to *.apps.manager.cloud-platform.service.justice.gov.uk.
    • Timeline: timeline
    • Slack thread: #ask-cloud-platform for the incident, #cloud-platform for the recovery.
  • Resolution:

    • The immediate repair was simple: perform an inline edit of the default certificate in live-1. Replacing the word manager with live-1 i.e. reverting the faulty change.
    • Further investigation ensued, finding the cause of the incident was actually an underlying bug in the infrastructure apply pipeline used to perform a terraform apply against manager.
    • This bug had been around from the creation of the pipeline but had never surfaced.
    • The pipeline uses an environment variable named KUBE_CTX to context switch between clusters. This works for resources using the terraform provider, however, not for null_resources, causing the change in the above PR to apply to the wrong cluster.
  • Review actions:

    • Provide guidance on namespace to namespace traffic - using network policy not ingress (and advertise it to users) Ticker #3082
    • Monitoring the cert - Kubehealthy monitor key things including cert. Could replace several of the integration tests that take longer. Ticket #3044
    • Canary app should have #high-priority-alerts after 2 minutes if it goes down. DONE in PR #5126
    • Fix the pipeline: in the cloud-platform-cli, create an assertion to ensure the cluster name is equal to the terraform workspace name. To prevent the null-resources acting on the wrong cluster. PR exists
    • Created a ticket to migrate all terraform null_resources within our modules to terraform kubectl provider
    • Created a ticket to set terraform kubernetes credentials dynamically (at executing time)
    • Fix the pipeline: Before the creation of Terraform resources, add a function in the cli to perform a kubectl context switch to the correct cluster. PR exists

Q2 2021 (April-June)

  • Mean Time to Repair: 2h 32m

  • Mean Time to Resolve: 2h 44m

Incident on 2021-06-09 12:47 - All users are unable to create new ingress rules, following bad ModSec Ingress-controller upgrade

  • Key events

    • First detected 2021-06-09 13:15
    • Repaired 2021-06-09 13:46
    • Incident declared 2020-06-09 13:54
    • Resolved 2021-06-09 13:58
  • Time to repair: 0h 31m

  • Time to resolve: 0h 43m

  • Identified: User reported in #ask-cloud-platform an error when deploying UAT application: kind Ingress: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://modsec01-nx-modsec-admission.ingress-controllers.svc:443/networking/v1beta1/ingresses?timeout=10s: x509: certificate is valid for modsec01-nx-controller-admission, modsec01-nx-controller-admission.ingress-controllers.svc, not modsec01-nx-modsec-admission.ingress-controllers.svc

  • Impact: It blocked all ingress API calls, so no new ingresses could be created, nor changes to current ingresses could be deployed, which included all user application deployments.

  • Context:

    • Occurred immediately following an upgrade to the ModSec Ingress-controller module v3.33.0, which apparently successfully deployed
    • It caused any new ingress or changes to current ingresses to be blocked by the ModSec Validation webhook
    • Timeline: Timeline for the incident.
    • Slack thread: #ask-cloud-platform for the incident, #cloud-platform for the recovery.
  • Resolution: Rollback to ModSec Ingress-controller module v0.0.7

  • Review actions:

    • Find out why this issue didn’t get flagged in the test cluster - try to reproduce the issue - maybe need another test? Ticket #2972
    • Add test that checks the alerts in alertmanager in smoke tests. Ticket #2973
    • Add helloworld app that uses modsec controller, for the smoke tests to check traffic works. Ticket #2974
    • Modsec module, new version, needs to be working on EKS for live-1 and live (neither old or new version work on live). Ticket #2975

Incident on 2021-05-10 12:15 - Apply Pipeline downtime due to accidental destroy of Manager cluster

  • Key events

    • First detected 2021-05-10 12:15
    • Incident not declared, but later agreed it was one
    • Repaired 2021-05-10 16:48
    • Resolved 2021-05-11 10:00
  • Time to repair: 4h 33m

  • Time to resolve: 4h 45m

  • Identified: CP team member did ‘terraform destroy components’, intending it to destroy a test cluster, but it was on Manager cluster by mistake. Was immediately aware of the error.

  • Impact:

    • Users couldn’t create or change their namespace definitions or AWS resources, due to Concourse being down
  • Context:

  • Resolution:

    • Manager cluster was recreated.
    • During this we encountered a certificate issue with Concourse, so it was restored manually. The terraform had got out of date for the Manager cluster.
    • Route53 zones were hard-coded and had to be changed manually.
  • Actions following review:

    • Spike ways to avoid applying to wrong cluster - see 3 options above. Ticket #3016
    • Try ‘Prevent destroy’ setting on R53 zone - Ticket #2899
    • Disband the cloud-platform-concourse repository. This includes Service accounts, and pipelines. We should split this repository up and move it to the infra/terraform-concourse repos. Ticket #3017
    • Manager needs to use our PSPs instead of eks-privilege - this has already been done.

Q1 2021 (January - March)

  • Mean Time to Repair: N/A

  • Mean Time to Resolve: N/A

No incidents declared


Q4 2020 (October - December)

  • Mean Time to Repair: 2h 8m

  • Mean Time to Resolve: 8h 46m

Incident on 2020-10-06 09:07 - Intermittent “micro-downtimes” on various services using dedicated ingress controllers

  • Key events

    • First detected 2020-10-06 08:33
    • Incident declared 2020-10-06 09:07
    • Repaired 2020-10-06 10:41
    • Resolved 2020-10-06 17:19
  • Time to repair: 2h 8m

  • Time to resolve: 8h 46m

  • Identified: User reported service problems in #ask-cloud-platform. Confirmed by checking Pingdom

  • Impact:

    • Numerous brief and intermittent outages for multiple (but not all) services (production and non-production) which were using dedicated ingress controllers
  • Context:

    • Occurred immediately after upgrading live-1 to kubernetes 1.17
    • 1.17 creates 2 additional SecurityGroupRules per ingress-controller, this took us over a hard AWS limit
    • Timeline: Timeline for the incident.
    • Slack thread: Slack thread for the incident.
  • Resolution:

    • Migrate all ingresses back to the default ingress controller

Q3 2020 (July - September)

  • Mean Time To Repair: 59m

  • Mean Time To Resolve: 7h 13m

Incident on 2020-09-28 13:10 - Termination of nodes updating kops Instance Group.

  • Key events

    • First detected 2020-09-28 13:14
    • Incident declared 2020-09-28 14:05
    • Repaired 2020-09-28 14:20
    • Resolved 2020-09-28 14:44
  • Time to repair: 0h 15m

  • Time to resolve: 1h 30m

  • Identified: Periods of downtime while the cloud-platform team was applying per Availability Zone instance groups for worker nodes change in live-1. Failures caused mainly due to termination of a group of 9 nodes and letting kops to handle the cycling of pods, which took very long time for the new containers to be created in the new node group.

  • Impact:

    • Some users noticed cycling of pods but taking a long time for the containers to be created.
    • Prometheus/alertmanager/kibana health check failures.
    • Users noticed short-lived pingdom alerts & health check failures.
  • Context:

    • kops node group (nodes-1.16.13) updated minSize from 25 to 18 nodes and ran kops update cluster –yes, this terminated 9 nodes from existing worker node group (nodes-1.16.13).
    • Pods are in pending status for a long time waiting to be scheduled in the new nodes.
    • Teams using their own ingress-controller have 1 replica for non-prod namespaces, causing some pingdom alerts & health check failures.
    • Timeline: Timeline for the incident.
    • Slack thread: Slack thread for the incident.
  • Resolution:

    • This is resolved by cordoning and draining nodes one by one before deleting the instance group.

Incident on 2020-09-21 18:27 - Some cloud-platform components destroyed.

  • Key events

    • First detected 2020-09-21 18:27
    • Incident declared 2020-09-21 18:40
    • Repaired 2020-09-21 19:05
    • Resolved 2020-09-21 21:41
  • Time to repair: 0h 38m

  • Time to resolve: 3h 14m

  • Identified: Some components of our production kubernetes cluster (live-1) were accidentally deleted, this caused some services running on cloud-platform gone down.

  • Impact:

    • Some users could not access services running on the Cloud Platform.
    • Prometheus/alertmanager/grafana is not accessible.
    • kibana is not accessible.
    • Cannot create new certificates.
  • Context:

    • Test cluster deletion script triggered to delete a test cluster, kube context incorrectly targeted the live-1 cluster and deleted some cloud-platform components.
    • Components include default ingress-controller, prometheus-operator, logging, cert-manager, kiam and external-dns. As ingress-controller gone down some users could not access services running on the Cloud Platform.
    • Formbuilder services not accessible even after ingress-controller is restored.
    • Timeline: Timeline for the incident.
    • Slack thread: Slack thread for the incident.
  • Resolution:

    • Team prioritised to restore default ingress controller, ingress-controller has a dependency of external-dns to update route53 records with new NLB and kiam for providing AWSAssumeRole for external-dns, these components (ingress-controller, external-dns and kiam) got restored successfully. Services start to come back up.
    • Formbuilder services are still pointing to the old NLB (network load balancer before ingress got replaced), reason for this is route53 TXT records was set incorrect owner field, so external-dns couldn’t update the new NLB information in the A record. Team fixed the owner information in the TXT record, external DNS updated formbuilder route53 records to point to new NLB. Formbuilder services is up and running.
    • Team did target apply to restore remaining components.
    • Apply pipleine run to restore all the certificates, servicemonitors and prometheus-rules from the environment repository.

Incident on 2020-09-07 12:54 - All users are unable to create new ingress rules

  • Key events

    • First detected 2020-09-07 12:39
    • Incident declared 2020-09-07 12:54
    • Resolved 2020-09-07 15:56
  • Time to repair: 3h 02m

  • Time to resolve: 3h 17m

  • Identified: The Ingress API refused 100% of POST requests.

  • Impact:

    • If a user were to provision a new service, they would be unable to create an ingress into the cluster.
  • Context:

    • Version 0.1.0 of the teams ingress controller module enabled the creation of a validationwebhookconfiguration resource.
    • By enabling this option we created a single point of failure for all ingress-controller pods in the ingress-controller namespace.
    • A new 0.1.0 ingress controller failed to create in the “live-1” cluster due to AWS resource limits.
    • Validation webhook stopped new rules from creating, with the error: Error from server (InternalError): error when creating "ingress.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post offender-categorisation-prod-nx-controller-admission.ingress-controllers.svc:443/extensions/v1beta1/ingresses?timeout=30s: x509: certificate signed by unknown authority
    • Initial investigation thread: https://mojdt.slack.com/archives/C514ETYJX/p1599478794246900
    • Incident declared: https://mojdt.slack.com/archives/C514ETYJX/p1599479640251900
  • Resolution: The team manually removed the all the additional admission controllers created by 0.1.0. They then removed the admission webhook from the module and created a new release (0.1.1). All ingress modules currently on 0.1.0 were upgraded to the new release 0.1.1.

Incident on 2020-08-25 11:26 - Connectivity issues with eu-west-2a

  • Key events

    • First detected 2020-08-25 11:01
    • Incident declared 2020-08-25 11:26
    • Resolved 2020-08-25 12:11
  • Time to repair: 0h 45m

  • Time to resolve: 1h 10m

  • Identified: The AWS Availability Zones eu-west-2a, which contain some of our kubernetes nodes had an outage. API latency was elevated, some EC2 became unreachable and overall connectivity was unstable.

  • Impact:

    • Two kubernetes nodes became unreachable
    • No new node could be launched in eu-west-2a
    • Kubernetes had issues talking to some of these nodes, preventing some API calls to succeed (Pods were not terminating)
    • New pods were not able to pull their Docker images.
  • Context:

    • Pods and Nodes sitting in other Availability Zones (b & c) were not impacted
    • Slack threads: Issue detected, Incident Declared,
    • We now have 25 pods in the cluster, instead of 21
  • Resolution: The incident was mitigated by deploying more 2-4 nodes in healthy Availability Zones, manually deleting the non-responding pods, and terminating the impacted nodes

Incident on 2020-08-14 11:01 - Ingress-controllers crashlooping

  • Key events

    • First detected 2020-08-14 10:43
    • Incident declared 2020-08-14 11:01
    • Resolved 2020-08-14 11:38
  • Time to repair: 0h 37m

  • Time to resolve: 0h 55m

  • Identified: There are 6 replicas of the ingress-controller pod and 2 out of the 6 were crashlooping. A restart of the pods did not resolve the issue. As per a normal runbook process, a recycle of all pods was required. However after restarting pods 4 and 5, they also started to crashloop. The risk was when restarting pods 5 and 6 - all 6 pods could be down and all ingresses down for the cluster.

  • Impact:

    • Increased risk for all ingresses failing in the cluster if all 6 ingress-controller pods are in a crashloop state.
  • Context:

  • Resolution: A restart of the leader ingress-controller pod was required so the other pods in the replica-set could connect and get the latest nginx.config file.

Incident on 2020-08-07 16:39 - Master node provisioning failure

  • Key events

    • First detected 2020-08-07 15:51
    • Repaired 2020-08-07 16:29
    • Incident declared 2020-08-07 16:39
    • Resolved 2020-08-14 10:06
  • Time to repair: 0h 38m

  • Time to resolve: 33h 15m (during support hours 10:00-17:00 M-F)

  • Identified: Routine replacement of a master node failed because AWS did not have any c4.4xlarge instances available in the relevant availability zone.

  • Impact:

    • Increased risk because the cluster was running on 2 out of 3 master nodes, for a brief period
  • Context:

  • Resolution:

    • A new c4.4xlarge node was successfully (and automatically) launched approx. 40 minutes after we saw the problem
    • We replaced all our master nodes with c5.4xlarge instances, which (currently) have better availability
    • We and AWS are still investigating longer-term and more reliable fixes

Q2 2020 (April - June)

  • Mean Time To Repair: 2h 49m

  • Mean Time To Resolve: 7h 12m

Incident on 2020-08-04 17:13

  • Key events

    • Fault occurs 2020-08-04 13:30
    • Fault detected 2020-08-04 18:13
    • Incident declared 2020-08-05 11:04
    • Resolved 2020-08-05 16:16
  • Time to repair: 5h 8m

  • Time to resolve: 9h 16m (during support hours 10:00-17:00)

  • Identified: Integration tests failed for cert-manager, apply pipeline failed showing it doesnot have permissions and divergence pipeline shows drift for live-1 components

  • Impact:

    • Increased risk for cluster failure because some of the components do not have the correct configuration needed for the live-1 production cluster
  • Context:

  • Resolution: Compare each resource configuration with the terraform state and applied the correct configuration from the code specific to kops cluster

Incident on 2020-04-15 10:58 Nginx/TLS

  • Key events

    • Fault occurs 2020-04-15 07:15
    • Fault detected 2020-04-15 13:45
    • Incident declared 2020-04-15 14:39
    • Resolved 2020-04-15 15:09
  • Status: Resolved at 2020-04-15 15:09 UTC

  • Time to repair: 0h 30m

  • Time to resolve: 5h 09m (during support hours 10:00-17:00)

  • Identified: After an upgrade of the Nginx ingresses, support for legacy TLS was dropped.

  • Impact:

    • IE11 users could not access any services running on the Cloud Platform
    • A few teams came forward with the issue :
    • LAA
    • Correspondence Tool
    • Prisoner Money
  • Context:

  • Resolution: The Nginx configuration was modified to enable TLSv1, TLSv1.1 and TLSv1.2


Q1 2020 (January - March)

  • Mean Time To Repair: 1h 22m

  • Mean Time To Resolve: 2h 36m

Incident on 2020-02-25 10:58

  • Key events

    • Fault occurs 2020-02-25 07:32
    • Team aware 2020-02-25 07:36
    • Incident declared 2020-02-25 10:58
    • Resolved 2020-02-25 17:07
  • Time to repair: 4h 9m

  • Time to resolve: 7h (during support hours 10:00-17:00)

  • Identified: During an upgrade, new masters were not coming up correctly (missing calico networking and other pods)

  • Impact:

    • Degraded kubernetes API performance (because some API calls were being directed to non-functioning masters)
    • Increased risk of cluster failure, because we were running on a single master during the incident
  • Context:

    • Upgrading from kubernetes 1.13.12 to 1.14.10, kops 1.13.2 to 1.14.1
    • The first master was replaced fine, but the second didn’t have calico and some other essential pods, and was not functioning correctly
    • Attempting to roll back the upgrade, every new master exhibited the same problem
    • Slack thread: https://mojdt.slack.com/archives/C514ETYJX/p1582628309085600
  • Resolution: The kube-system namespace has a label, openpolicyagent.org/webhook: ignore This label tells the Open Policy Agent (OPA) that pods are allowed to run in this namespace on the master nodes. Somehow, this label got removed, so the OPA was preventing pods from running on the new master nodes, as each one came up, so the new master was unable to launch essential pods such as calico and fluentd.

Incident on 2020-02-18 14:13 UTC

  • Key events

    • Fault occurs 2020-02-18 14:13
    • Incident declared 2020-02-18 14:23
    • Resolved 2020-02-18 14:59
  • Time to repair: 0h 36m

  • Time to resolve: 0h 46m

  • Identified: Pingdom reported that Prometheus was down (prometheus.cloud-platform.service.justice.gov.uk).

  • Impact:

    • The prometheus dashboard was unavailable for everyone, for the whole duration of the incident.
    • Between 2020-02-18 14:22 and 2020-02-18 14:26, prometheus could not receive metrics.
  • Context:

    • Although the Prometheus URL was unreachable, Grafana and Alertmanager were resolving.
    • There seemed to be an issue preventing requests to reach the prometheus pods.
    • Disk space and other resources, the usual suspects, were ruled out as the cause.
    • The domain name amd ingress were both valid.
    • Slack thread:
  • Resolution: We suspect an intermittent & external networking issue to be the cause of this outage.

Incident on 2020-02-12 11:45 UTC

  • Key events

    • Fault occurs 2020-02-12 11:45
    • Incident declared 2020-02-12 11:51
    • Resolved 2020-02-12 12:07
  • Time to repair: 0h 16m

  • Time to resolve: 0h 22m

  • Identified: Pingdom reported Concourse (concourse.cloud-platform.service.justice.gov.uk) down.

  • Context:

    • One of the engineers was deleting old clusters (he ran terraform destroy) and he wasn’t fully aware in which terraform workspace was working on. Using terraform destroy, EKS nodes/workers were deleted from the manager cluster.
    • Slack thread:
    • Resolution: Using terraform (terraform apply -var-file vars/manager.tfvars specifically) the cluster nodes where created and the infrastructure aligned? to the desired terraform state

About this incident log

The purpose of publishing this incident log:

  • for the Cloud Platform team to learn from incidents
  • for the Cloud Platform team and its stakeholders to track incident trends and performance
  • because we operate in the open

Definitions:

  • The words used in the timeline of an incident: fault occurs, team becomes aware (of something bad), incident declared (the team acknowledges and has an idea of the impact), repaired (system is fully functional), resolved (fully functional and future failures are prevented)
  • Incident time - The start of the failure (Before March 2020 it was the time the incident was declared)
  • Time to Repair - The time between the incident being declared (or when the team became aware of the fault) and when service is fully restored. Only includes Hours of Support.
  • Time to Resolve - The time between when the fault occurs and when system is fully functional (and include any immediate work done to prevent future failures). Only includes Hours of Support. This is a broader metric of incident response performance, compared to Time to Repair.

Source: Atlassian

Datestamps: please use YYYY-MM-DD HH:MM (almost ISO 8601, but more readable), for the London timezone

Template

Incident on YYYY-MM-DD HH:MM - [Brief description]

  • Key events

    • First detected YYYY-MM-DD HH:MM
    • Incident declared YYYY-MM-DD HH:MM
    • Repaired YYYY-MM-DD HH:MM
    • Resolved YYYY-MM-DD HH:MM
  • Time to repair: Xh Xm

  • Time to resolve: Xh Xm

  • Identified:

  • Impact:

  • Context:

    • Timeline: [Timeline](url of google document) for the incident
    • Slack thread: [Slack thread](url of primary incident thread) for the incident.
  • Resolution:

  • Review actions: