Skip to main content

Incident Log

Use the mean-time-to-repair.rb script to view performance metrics

Q1 2021 (January - March)

  • Mean Time to repair: N/A

  • Mean Time to resolve: N/A

No incidents declared

Q4 2020 (October - December)

  • Mean Time to repair: 2h 8m

  • Mean Time to resolve: 8h 46m

Incident on 2020-10-06 09:07 - Intermittent “micro-downtimes” on various services using dedicated ingress controllers

  • Key events

    • First detected 2020-10-06 08:33
    • Incident declared 2020-10-06 09:07
    • Repaired 2020-10-06 10:41
    • Resolved 2020-10-06 17:19
  • Time to repair: 2h 8m

  • Time to resolve: 8h 46m

  • Identified: User reported service problems in #ask-cloud-platform. Confirmed by checking Pingdom

  • Impact:

    • Numerous brief and intermittent outages for multiple (but not all) services (production and non-production) which were using dedicated ingress controllers
  • Context:

    • Occurred immediately after upgrading live-1 to kubernetes 1.17
    • 1.17 creates 2 additional SecurityGroupRules per ingress-controller, this took us over a hard AWS limit
    • Timeline: Timeline for the incident.
    • Slack thread: Slack thread for the incident.
  • Resolution:

    • Migrate all ingresses back to the default ingress controller

Q3 2020 (July - September)

  • Mean Time To Repair: 1h 9m

  • Mean Time To Resolve: 7h 26m

Incident on 2020-09-28 13:10 - Termination of nodes updating kops Instance Group.

  • Key events

    • First detected 2020-09-28 13:14
    • Incident declared 2020-09-28 14:05
    • Repaired 2020-09-28 14:20
    • Resolved 2020-09-28 14:44
  • Time to repair: 0h 15m

  • Time to resolve: 1h 30m

  • Identified: Periods of downtime while the cloud-platform team was applying per Availability Zone instance groups for worker nodes change in live-1. Failures caused mainly due to termination of a group of 9 nodes and letting kops to handle the cycling of pods, which took very long time for the new containers to be created in the new node group.

  • Impact:

    • Some users noticed cycling of pods but taking a long time for the containers to be created.
    • Prometheus/alertmanager/kibana health check failures.
    • Users noticed short-lived pingdom alerts & health check failures.
  • Context:

    • kops node group (nodes-1.16.13) updated minSize from 25 to 18 nodes and ran kops update cluster –yes, this terminated 9 nodes from existing worker node group (nodes-1.16.13).
    • Pods are in pending status for a long time waiting to be scheduled in the new nodes.
    • Teams using their own ingress-controller have 1 replica for non-prod namespaces, causing some pingdom alerts & health check failures.
    • Timeline: Timeline for the incident.
    • Slack thread: Slack thread for the incident.
  • Resolution:

    • This is resolved by cordoning and draining nodes one by one before deleting the instance group.

Incident on 2020-09-21 18:27 - Some cloud-platform components destroyed.

  • Key events

    • First detected 2020-09-21 18:27
    • Incident declared 2020-09-21 18:40
    • Repaired 2020-09-21 19:05
    • Resolved 2020-09-21 21:41
  • Time to repair: 0h 38m

  • Time to resolve: 3h 14m

  • Identified: Some components of our production kubernetes cluster (live-1) were accidentally deleted, this caused some services running on cloud-platform gone down.

  • Impact:

    • Some users could not access services running on the Cloud Platform.
    • Prometheus/alertmanager/grafana is not accessible.
    • kibana is not accessible.
    • Cannot create new certificates.
  • Context:

    • Test cluster deletion script triggered to delete a test cluster, kube context incorrectly targeted the live-1 cluster and deleted some cloud-platform components.
    • Components include default ingress-controller, prometheus-operator, logging, cert-manager, kiam and external-dns. As ingress-controller gone down some users could not access services running on the Cloud Platform.
    • Formbuilder services not accessible even after ingress-controller is restored.
    • Timeline: Timeline for the incident.
    • Slack thread: Slack thread for the incident.
  • Resolution:

    • Team prioritised to restore default ingress controller, ingress-controller has a dependency of external-dns to update route53 records with new NLB and kiam for providing AWSAssumeRole for external-dns, these components (ingress-controller, external-dns and kiam) got restored successfully. Services start to come back up.
    • Formbuilder services are still pointing to the old NLB (network load balancer before ingress got replaced), reason for this is route53 TXT records was set incorrect owner field, so external-dns couldn’t update the new NLB information in the A record. Team fixed the owner information in the TXT record, external DNS updated formbuilder route53 records to point to new NLB. Formbuilder services is up and running.
    • Team did target apply to restore remaining components.
    • Apply pipleine run to restore all the certificates, servicemonitors and prometheus-rules from the environment repository.

Incident on 2020-09-07 12:54 - All users are unable to create new ingress rules

  • Key events

    • First detected 2020-09-07 12:39
    • Incident declared 2020-09-07 12:54
    • Resolved 2020-09-07 15:56
  • Time to repair: 3h 02m

  • Time to resolve: 3h 17m

  • Identified: The Ingress API refused 100% of POST requests.

  • Impact:

    • If a user were to provision a new service, they would be unable to create an ingress into the cluster.
  • Context:

    • Version 0.1.0 of the teams ingress controller module enabled the creation of a validationwebhookconfiguration resource.
    • By enabling this option we created a single point of failure for all ingress-controller pods in the ingress-controller namespace.
    • A new 0.1.0 ingress controller failed to create in the “live-1” cluster due to AWS resource limits.
    • Validation webhook stopped new rules from creating, with the error: Error from server (InternalError): error when creating "ingress.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://offender-categorisation-prod-nx-controller-admission.ingress-controllers.svc:443/extensions/v1beta1/ingresses?timeout=30s: x509: certificate signed by unknown authority
    • Initial investigation thread: https://mojdt.slack.com/archives/C514ETYJX/p1599478794246900
    • Incident declared: https://mojdt.slack.com/archives/C514ETYJX/p1599479640251900
  • Resolution: The team manually removed the all the additional admission controllers created by 0.1.0. They then removed the admission webhook from the module and created a new release (0.1.1). All ingress modules currently on 0.1.0 were upgraded to the new release 0.1.1.

Incident on 2020-08-25 11:26 - Connectivity issues with eu-west-2a

  • Key events

    • First detected 2020-08-25 11:01
    • Incident declared 2020-08-25 11:26
    • Resolved 2020-08-25 12:11
  • Time to repair: 0h 45m

  • Time to resolve: 1h 10m

  • Identified: The AWS Availability Zones eu-west-2a, which contain some of our kubernetes nodes had an outage. API latency was elevated, some EC2 became unreachable and overall connectivity was unstable.

  • Impact:

    • Two kubernetes nodes became unreachable
    • No new node could be launched in eu-west-2a
    • Kubernetes had issues talking to some of these nodes, preventing some API calls to succeed (Pods were not terminating)
    • New pods were not able to pull their Docker images.
  • Context:

    • Pods and Nodes sitting in other Availability Zones (b & c) were not impacted
    • Slack threads: Issue detected, Incident Declared,
    • We now have 25 pods in the cluster, instead of 21
  • Resolution: The incident was mitigated by deploying more 2-4 nodes in healthy Availability Zones, manually deleting the non-responding pods, and terminating the impacted nodes

Incident on 2020-08-14 11:01 - Ingress-controllers crashlooping

  • Key events

    • First detected 2020-08-14 10:43
    • Incident declared 2020-08-14 11:01
    • Resolved 2020-08-14 11:38
  • Time to repair: 0h 37m

  • Time to resolve: 0h 55m

  • Identified: There are 6 replicas of the ingress-controller pod and 2 out of the 6 were crashlooping. A restart of the pods did not resolve the issue. As per a normal runbook process, a recycle of all pods was required. However after restarting pods 4 and 5, they also started to crashloop. The risk was when restarting pods 5 and 6 - all 6 pods could be down and all ingresses down for the cluster.

  • Impact:

    • Increased risk for all ingresses failing in the cluster if all 6 ingress-controller pods are in a crashloop state.
  • Context:

  • Resolution: A restart of the leader ingress-controller pod was required so the other pods in the replica-set could connect and get the latest nginx.config file.

Incident on 2020-08-07 16:39 - Master node provisioning failure

  • Key events

    • First detected 2020-08-07 15:51
    • Repaired 2020-08-07 16:29
    • Incident declared 2020-08-07 16:39
    • Resolved 2020-08-14 10:06
  • Time to repair: 0h 38m

  • Time to resolve: 33h 15m (during support hours 10:00-17:00 M-F)

  • Identified: Routine replacement of a master node failed because AWS did not have any c4.4xlarge instances available in the relevant availability zone.

  • Impact:

    • Increased risk because the cluster was running on 2 out of 3 master nodes, for a brief period
  • Context:

  • Resolution:

    • A new c4.4xlarge node was successfully (and automatically) launched approx. 40 minutes after we saw the problem
    • We replaced all our master nodes with c5.4xlarge instances, which (currently) have better availability
    • We and AWS are still investigating longer-term and more reliable fixes

Q2 2020 (April - June)

  • Mean Time To Repair: 2h 5m

  • Mean Time To Resolve: 15h 53m

Incident on 2020-08-04 17:13

  • Key events

    • Fault occurs 2020-08-04 13:30
    • Fault detected 2020-08-04 18:13
    • Incident declared 2020-08-05 11:04
    • Resolved 2020-08-05 16:16
  • Time to repair: 5h 8m

  • Time to resolve: 9h 16m (during support hours 10:00-17:00)

  • Identified: Integration tests failed for cert-manager, apply pipeline failed showing it doesnot have permissions and divergence pipeline shows drift for live-1 components

  • Impact:

    • Increased risk for cluster failure because some of the components do not have the correct configuration needed for the live-1 production cluster
  • Context:

  • Resolution: Compare each resource configuration with the terraform state and applied the correct configuration from the code specific to kops cluster

Incident on 2020-04-15 10:58 Nginx/TLS

  • Key events

    • Fault occurs 2020-04-15 07:15
    • Fault detected 2020-04-15 13:45
    • Incident declared 2020-04-15 14:39
    • Resolved 2020-04-15 15:09
  • Status: Resolved at 2020-04-15 15:09 UTC

  • Time to repair: 0h 30m

  • Time to resolve: 5h 09m (during support hours 10:00-17:00)

  • Identified: After an upgrade of the Nginx ingresses, support for legacy TLS was dropped.

  • Impact:

    • IE11 users could not access any services running on the Cloud Platform
    • A few teams came forward with the issue :
    • LAA
    • Correspondence Tool
    • Prisoner Money
  • Context:

  • Resolution: The Nginx configuration was modified to enable TLSv1, TLSv1.1 and TLSv1.2

Q1 2020 (January - March)

  • Mean Time To Repair: 1h 22m

  • Mean Time To Resolve: 2h 36m

Incident on 2020-02-25 10:58

  • Key events

    • Fault occurs 2020-02-25 07:32
    • Team aware 2020-02-25 07:36
    • Incident declared 2020-02-25 10:58
    • Resolved 2020-02-25 17:07
  • Time to repair: 4h 9m

  • Time to resolve: 7h (during support hours 10:00-17:00)

  • Identified: During an upgrade, new masters were not coming up correctly (missing calico networking and other pods)

  • Impact:

    • Degraded kubernetes API performance (because some API calls were being directed to non-functioning masters)
    • Increased risk of cluster failure, because we were running on a single master during the incident
  • Context:

    • Upgrading from kubernetes 1.13.12 to 1.14.10, kops 1.13.2 to 1.14.1
    • The first master was replaced fine, but the second didn’t have calico and some other essential pods, and was not functioning correctly
    • Attempting to roll back the upgrade, every new master exhibited the same problem
    • Slack thread: https://mojdt.slack.com/archives/C514ETYJX/p1582628309085600
  • Resolution: The kube-system namespace has a label, openpolicyagent.org/webhook: ignore This label tells the Open Policy Agent (OPA) that pods are allowed to run in this namespace on the master nodes. Somehow, this label got removed, so the OPA was preventing pods from running on the new master nodes, as each one came up, so the new master was unable to launch essential pods such as calico and fluentd.

Incident on 2020-02-18 14:13 UTC

  • Key events

    • Fault occurs 2020-02-18 14:13
    • Incident declared 2020-02-18 14:23
    • Resolved 2020-02-18 14:59
  • Time to repair: 0h 36m

  • Time to resolve: 0h 46m

  • Identified: Pingdom reported that Prometheus was down (prometheus.cloud-platform.service.justice.gov.uk).

  • Impact:

    • The prometheus dashboard was unavailable for everyone, for the whole duration of the incident.
    • Between 2020-02-18 14:22 and 2020-02-18 14:26, prometheus could not receive metrics.
  • Context:

    • Although the Prometheus URL was unreachable, Grafana and Alertmanager were resolving.
    • There seemed to be an issue preventing requests to reach the prometheus pods.
    • Disk space and other resources, the usual suspects, were ruled out as the cause.
    • The domain name amd ingress were both valid.
    • Slack thread: https://mojdt.slack.com/archives/C514ETYJX/p1582035803248800
  • Resolution: We suspect an intermittent & external networking issue to be the cause of this outage.

Incident on 2020-02-12 11:45 UTC

  • Key events

    • Fault occurs 2020-02-12 11:45
    • Incident declared 2020-02-12 11:51
    • Resolved 2020-02-12 12:07
  • Time to repair: 0h 16m

  • Time to resolve: 0h 22m

  • Identified: Pingdom reported Concourse (concourse.cloud-platform.service.justice.gov.uk) down.

  • Context:

    • One of the engineers was deleting old clusters (he ran terraform destroy) and he wasn’t fully aware in which terraform workspace was working on. Using terraform destroy, EKS nodes/workers were deleted from the manager cluster.
    • Slack thread: https://mojdt.slack.com/archives/C514ETYJX/p1581508273080900
    • Resolution: Using terraform (terraform apply -var-file vars/manager.tfvars specifically) the cluster nodes where created and the infrastructure aligned? to the desired terraform state

About this incident log

The purpose of publishing this incident log:

  • for the Cloud Platform team to learn from incidents
  • for the Cloud Platform team and its stakeholders to track incident trends and performance
  • because we operate in the open

Definitions:

  • The words used in the timeline of an incident: fault occurs, team becomes aware (of something bad), incident declared (the team acknowledges and has an idea of the impact), repaired (system is fully functional), resolved (fully functional and future failures are prevented)
  • Incident time - The start of the failure (Before March 2020 it was the time the incident was declared)
  • Time to Repair - The time between the incident being declared and when service is fully restored. Only includes Hours of Support.
  • Time to Resolve - The time between when the fault occurs and when system is fully functional (and include any immediate work done to prevent future failures). Only includes Hours of Support. This is a broader metric of incident response performance, compared to Time to Repair.

Source: Atlassian

Datestamps: please use YYYY-MM-DD HH:MM (almost ISO 8601, but more readable), for the London timezone

Template

Incident on YYYY-MM-DD HH:MM - [Brief description]

  • Key events

  • Time to repair: Xh Xm

  • Time to resolve: Xh Xm

  • Identified:

  • Impact:

  • Context:

  • Resolution: