Skip to main content

Incident Log

Use the mean-time-to-repair.rb script to view performance metrics

Q3 2021 (July-September)

  • Mean Time to Repair: 3h 28m

  • Mean Time to Resolve: 11h 4m

Incident on 2021-09-30 - SSL Certificate Issue in browsers

  • Key events

    • First detected 2021-09-30 15:31
    • Repaired 2021-10-01 10:29
    • Incident declared 2021-09-30 17:26
    • Resolved 2021-10-01 13:09
  • Time to repair: 5h 3m

  • Time to resolve: 7h 43m

  • Identified: User reported that they are getting SSL certificate errors when browsing sites which are hosted on Cloud Platform

  • Impact:

    • 300 LAA caseworkers and thousands of DOM1 users using CP-based digital services if it was during office hours. They had Firefox as a fallback and no actual reports.
    • Public users - No reports.
  • Context:

  • Resolution:

    • The new certificate was pushed to DOM1 and Quantum machines by the engineers who have been contracted to manage these devices
  • Review actions:

    • How to get latest announcements/ releases of components used in CP stack? Ticket raised (#3262)[https://github.com/ministryofjustice/cloud-platform/issues/3262]
    • Can we use AWS Certificate Manager instead of Letsencrypt? Ticket raised (#3263)[https://github.com/ministryofjustice/cloud-platform/issues/3263]
    • How would the team escalate a major incident e.g. CP goes down. Runbook page (here)[https://runbooks.cloud-platform.service.justice.gov.uk/incident-process.html#3-3-communications-lead]
    • How we can get visibility of ServiceNow service issues for CP-hosted services. Ticket raised (3264)[https://github.com/ministryofjustice/cloud-platform/issues/3264]

Incident on 2021-09-04 22:05 - Pingdom check Prometheus Cloud-Platform - Healthcheck is DOWN

  • Key events

    • First detected 2021-09-04 22:05
    • Repaired 2021-09-05 12:16
    • Incident declared 2021-09-05 12:53
    • Resolved 2021-09-05 12:27
  • Time to repair: 5h 16m

  • Time to resolve: 5h 27m

  • Identified: Prometheus Pod restarted several times with error OOMKilled causing Prometheus Healthcheck to go down

  • Impact:

    • The monitoring system of the cluster was not available
    • All application metrics were lost during that time period
  • Context:

  • Resolution:

    • Increased the memory limit for Prometheus container from 25Gi to 50Gi
  • Review actions:

    • Created a ticket to configure Thanos querier to query data for longer period
    • Created a ticket to add an alert to check when prometheus container hit 90% resource limit set
    • Created a ticket to create a grafana dashboard to display queries that take more than 1 minute to complete
    • Increase the memory limit for Prometheus container to 60GiPR #105
    • Test Pagerduty settings for weekends of the Cloud Platform on-call person to receive alarm immediately on the phone when a high priority alert is triggered

Incident on 2021-07-12 15:24 - All ingress resources using *apps.live-1 domain names stop working

  • Key events

    • First detected 2021-07-12 15:44
    • Repaired 2021-07-12 15:51
    • Incident declared 2021-07-12 16:09
    • Resolved 2021-07-13 11:49
  • Time to repair: 0h 07m

  • Time to resolve: 20h 03m

  • Identified: User reported in #ask-cloud-platform an error from the APM monitoring platform Sentry: Hostname/IP does not match certificate's altnames

  • Impact: All ingress resources using the *apps.live-1.cloud-platform.service.justice.gov.uk have mismatched certificates.

  • Context:

    • Occurred immediately following an upgrade to the default certificate of “live” clusters (PR here: https://github.com/ministryofjustice/cloud-platform-terraform-ingress-controller/pull/20)
    • The change amended the default certificate in the live-1 cluster to *.apps.manager.cloud-platform.service.justice.gov.uk.
    • Timeline: timeline
    • Slack thread: #ask-cloud-platform for the incident, #cloud-platform for the recovery.
  • Resolution:

    • The immediate repair was simple: perform an inline edit of the default certificate in live-1. Replacing the word manager with live-1 i.e. reverting the faulty change.
    • Further investigation ensued, finding the cause of the incident was actually an underlying bug in the infrastructure apply pipeline used to perform a terraform apply against manager.
    • This bug had been around from the creation of the pipeline but had never surfaced.
    • The pipeline uses an environment variable named KUBE_CTX to context switch between clusters. This works for resources using the terraform provider, however, not for null_resources, causing the change in the above PR to apply to the wrong cluster.
  • Review actions:

    • Provide guidance on namespace to namespace traffic - using network policy not ingress (and advertise it to users) Ticker #3082
    • Monitoring the cert - Kubehealthy monitor key things including cert. Could replace several of the integration tests that take longer. Ticket #3044
    • Canary app should have #high-priority-alerts after 2 minutes if it goes down. DONE in PR #5126
    • Fix the pipeline: in the cloud-platform-cli, create an assertion to ensure the cluster name is equal to the terraform workspace name. To prevent the null-resources acting on the wrong cluster. PR exists
    • Created a ticket to migrate all terraform null_resources within our modules to terraform kubectl provider
    • Created a ticket to set terraform kubernetes credentials dynamically (at executing time)
    • Fix the pipeline: Before the creation of Terraform resources, add a function in the cli to perform a kubectl context switch to the correct cluster. PR exists

Q2 2021 (April-June)

  • Mean Time to Repair: 2h 32m

  • Mean Time to Resolve: 2h 44m

Incident on 2021-06-09 12:47 - All users are unable to create new ingress rules, following bad ModSec Ingress-controller upgrade

  • Key events

    • First detected 2021-06-09 13:15
    • Repaired 2021-06-09 13:46
    • Incident declared 2020-06-09 13:54
    • Resolved 2021-06-09 13:58
  • Time to repair: 0h 31m

  • Time to resolve: 0h 43m

  • Identified: User reported in #ask-cloud-platform an error when deploying UAT application: kind Ingress: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://modsec01-nx-modsec-admission.ingress-controllers.svc:443/networking/v1beta1/ingresses?timeout=10s: x509: certificate is valid for modsec01-nx-controller-admission, modsec01-nx-controller-admission.ingress-controllers.svc, not modsec01-nx-modsec-admission.ingress-controllers.svc

  • Impact: It blocked all ingress API calls, so no new ingresses could be created, nor changes to current ingresses could be deployed, which included all user application deployments.

  • Context:

    • Occurred immediately following an upgrade to the ModSec Ingress-controller module v3.33.0, which apparently successfully deployed
    • It caused any new ingress or changes to current ingresses to be blocked by the ModSec Validation webhook
    • Timeline: Timeline for the incident.
    • Slack thread: #ask-cloud-platform for the incident, #cloud-platform for the recovery.
  • Resolution: Rollback to ModSec Ingress-controller module v0.0.7

  • Review actions:

    • Find out why this issue didn’t get flagged in the test cluster - try to reproduce the issue - maybe need another test? Ticket #2972
    • Add test that checks the alerts in alertmanager in smoke tests. Ticket #2973
    • Add helloworld app that uses modsec controller, for the smoke tests to check traffic works. Ticket #2974
    • Modsec module, new version, needs to be working on EKS for live-1 and live (neither old or new version work on live). Ticket #2975

Incident on 2021-05-10 12:15 - Apply Pipeline downtime due to accidental destroy of Manager cluster

  • Key events

    • First detected 2021-05-10 12:15
    • Incident not declared, but later agreed it was one
    • Repaired 2021-05-10 16:48
    • Resolved 2021-05-11 10:00
  • Time to repair: 4h 33m

  • Time to resolve: 4h 45m

  • Identified: CP team member did ‘terraform destroy components’, intending it to destroy a test cluster, but it was on Manager cluster by mistake. Was immediately aware of the error.

  • Impact:

    • Users couldn’t create or change their namespace definitions or AWS resources, due to Concourse being down
  • Context:

  • Resolution:

    • Manager cluster was recreated.
    • During this we encountered a certificate issue with Concourse, so it was restored manually. The terraform had got out of date for the Manager cluster.
    • Route53 zones were hard-coded and had to be changed manually.
  • Actions following review:

    • Spike ways to avoid applying to wrong cluster - see 3 options above. Ticket #3016
    • Try ‘Prevent destroy’ setting on R53 zone - Ticket #2899
    • Disband the cloud-platform-concourse repository. This includes Service accounts, and pipelines. We should split this repository up and move it to the infra/terraform-concourse repos. Ticket #3017
    • Manager needs to use our PSPs instead of eks-privilege - this has already been done.

Q1 2021 (January - March)

  • Mean Time to Repair: N/A

  • Mean Time to Resolve: N/A

No incidents declared

Q4 2020 (October - December)

  • Mean Time to Repair: 2h 8m

  • Mean Time to Resolve: 8h 46m

Incident on 2020-10-06 09:07 - Intermittent “micro-downtimes” on various services using dedicated ingress controllers

  • Key events

    • First detected 2020-10-06 08:33
    • Incident declared 2020-10-06 09:07
    • Repaired 2020-10-06 10:41
    • Resolved 2020-10-06 17:19
  • Time to repair: 2h 8m

  • Time to resolve: 8h 46m

  • Identified: User reported service problems in #ask-cloud-platform. Confirmed by checking Pingdom

  • Impact:

    • Numerous brief and intermittent outages for multiple (but not all) services (production and non-production) which were using dedicated ingress controllers
  • Context:

    • Occurred immediately after upgrading live-1 to kubernetes 1.17
    • 1.17 creates 2 additional SecurityGroupRules per ingress-controller, this took us over a hard AWS limit
    • Timeline: Timeline for the incident.
    • Slack thread: Slack thread for the incident.
  • Resolution:

    • Migrate all ingresses back to the default ingress controller

Q3 2020 (July - September)

  • Mean Time To Repair: 1h 9m

  • Mean Time To Resolve: 7h 26m

Incident on 2020-09-28 13:10 - Termination of nodes updating kops Instance Group.

  • Key events

    • First detected 2020-09-28 13:14
    • Incident declared 2020-09-28 14:05
    • Repaired 2020-09-28 14:20
    • Resolved 2020-09-28 14:44
  • Time to repair: 0h 15m

  • Time to resolve: 1h 30m

  • Identified: Periods of downtime while the cloud-platform team was applying per Availability Zone instance groups for worker nodes change in live-1. Failures caused mainly due to termination of a group of 9 nodes and letting kops to handle the cycling of pods, which took very long time for the new containers to be created in the new node group.

  • Impact:

    • Some users noticed cycling of pods but taking a long time for the containers to be created.
    • Prometheus/alertmanager/kibana health check failures.
    • Users noticed short-lived pingdom alerts & health check failures.
  • Context:

    • kops node group (nodes-1.16.13) updated minSize from 25 to 18 nodes and ran kops update cluster –yes, this terminated 9 nodes from existing worker node group (nodes-1.16.13).
    • Pods are in pending status for a long time waiting to be scheduled in the new nodes.
    • Teams using their own ingress-controller have 1 replica for non-prod namespaces, causing some pingdom alerts & health check failures.
    • Timeline: Timeline for the incident.
    • Slack thread: Slack thread for the incident.
  • Resolution:

    • This is resolved by cordoning and draining nodes one by one before deleting the instance group.

Incident on 2020-09-21 18:27 - Some cloud-platform components destroyed.

  • Key events

    • First detected 2020-09-21 18:27
    • Incident declared 2020-09-21 18:40
    • Repaired 2020-09-21 19:05
    • Resolved 2020-09-21 21:41
  • Time to repair: 0h 38m

  • Time to resolve: 3h 14m

  • Identified: Some components of our production kubernetes cluster (live-1) were accidentally deleted, this caused some services running on cloud-platform gone down.

  • Impact:

    • Some users could not access services running on the Cloud Platform.
    • Prometheus/alertmanager/grafana is not accessible.
    • kibana is not accessible.
    • Cannot create new certificates.
  • Context:

    • Test cluster deletion script triggered to delete a test cluster, kube context incorrectly targeted the live-1 cluster and deleted some cloud-platform components.
    • Components include default ingress-controller, prometheus-operator, logging, cert-manager, kiam and external-dns. As ingress-controller gone down some users could not access services running on the Cloud Platform.
    • Formbuilder services not accessible even after ingress-controller is restored.
    • Timeline: Timeline for the incident.
    • Slack thread: Slack thread for the incident.
  • Resolution:

    • Team prioritised to restore default ingress controller, ingress-controller has a dependency of external-dns to update route53 records with new NLB and kiam for providing AWSAssumeRole for external-dns, these components (ingress-controller, external-dns and kiam) got restored successfully. Services start to come back up.
    • Formbuilder services are still pointing to the old NLB (network load balancer before ingress got replaced), reason for this is route53 TXT records was set incorrect owner field, so external-dns couldn’t update the new NLB information in the A record. Team fixed the owner information in the TXT record, external DNS updated formbuilder route53 records to point to new NLB. Formbuilder services is up and running.
    • Team did target apply to restore remaining components.
    • Apply pipleine run to restore all the certificates, servicemonitors and prometheus-rules from the environment repository.

Incident on 2020-09-07 12:54 - All users are unable to create new ingress rules

  • Key events

    • First detected 2020-09-07 12:39
    • Incident declared 2020-09-07 12:54
    • Resolved 2020-09-07 15:56
  • Time to repair: 3h 02m

  • Time to resolve: 3h 17m

  • Identified: The Ingress API refused 100% of POST requests.

  • Impact:

    • If a user were to provision a new service, they would be unable to create an ingress into the cluster.
  • Context:

    • Version 0.1.0 of the teams ingress controller module enabled the creation of a validationwebhookconfiguration resource.
    • By enabling this option we created a single point of failure for all ingress-controller pods in the ingress-controller namespace.
    • A new 0.1.0 ingress controller failed to create in the “live-1” cluster due to AWS resource limits.
    • Validation webhook stopped new rules from creating, with the error: Error from server (InternalError): error when creating "ingress.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post offender-categorisation-prod-nx-controller-admission.ingress-controllers.svc:443/extensions/v1beta1/ingresses?timeout=30s: x509: certificate signed by unknown authority
    • Initial investigation thread: https://mojdt.slack.com/archives/C514ETYJX/p1599478794246900
    • Incident declared: https://mojdt.slack.com/archives/C514ETYJX/p1599479640251900
  • Resolution: The team manually removed the all the additional admission controllers created by 0.1.0. They then removed the admission webhook from the module and created a new release (0.1.1). All ingress modules currently on 0.1.0 were upgraded to the new release 0.1.1.

Incident on 2020-08-25 11:26 - Connectivity issues with eu-west-2a

  • Key events

    • First detected 2020-08-25 11:01
    • Incident declared 2020-08-25 11:26
    • Resolved 2020-08-25 12:11
  • Time to repair: 0h 45m

  • Time to resolve: 1h 10m

  • Identified: The AWS Availability Zones eu-west-2a, which contain some of our kubernetes nodes had an outage. API latency was elevated, some EC2 became unreachable and overall connectivity was unstable.

  • Impact:

    • Two kubernetes nodes became unreachable
    • No new node could be launched in eu-west-2a
    • Kubernetes had issues talking to some of these nodes, preventing some API calls to succeed (Pods were not terminating)
    • New pods were not able to pull their Docker images.
  • Context:

    • Pods and Nodes sitting in other Availability Zones (b & c) were not impacted
    • Slack threads: Issue detected, Incident Declared,
    • We now have 25 pods in the cluster, instead of 21
  • Resolution: The incident was mitigated by deploying more 2-4 nodes in healthy Availability Zones, manually deleting the non-responding pods, and terminating the impacted nodes

Incident on 2020-08-14 11:01 - Ingress-controllers crashlooping

  • Key events

    • First detected 2020-08-14 10:43
    • Incident declared 2020-08-14 11:01
    • Resolved 2020-08-14 11:38
  • Time to repair: 0h 37m

  • Time to resolve: 0h 55m

  • Identified: There are 6 replicas of the ingress-controller pod and 2 out of the 6 were crashlooping. A restart of the pods did not resolve the issue. As per a normal runbook process, a recycle of all pods was required. However after restarting pods 4 and 5, they also started to crashloop. The risk was when restarting pods 5 and 6 - all 6 pods could be down and all ingresses down for the cluster.

  • Impact:

    • Increased risk for all ingresses failing in the cluster if all 6 ingress-controller pods are in a crashloop state.
  • Context:

  • Resolution: A restart of the leader ingress-controller pod was required so the other pods in the replica-set could connect and get the latest nginx.config file.

Incident on 2020-08-07 16:39 - Master node provisioning failure

  • Key events

    • First detected 2020-08-07 15:51
    • Repaired 2020-08-07 16:29
    • Incident declared 2020-08-07 16:39
    • Resolved 2020-08-14 10:06
  • Time to repair: 0h 38m

  • Time to resolve: 33h 15m (during support hours 10:00-17:00 M-F)

  • Identified: Routine replacement of a master node failed because AWS did not have any c4.4xlarge instances available in the relevant availability zone.

  • Impact:

    • Increased risk because the cluster was running on 2 out of 3 master nodes, for a brief period
  • Context:

  • Resolution:

    • A new c4.4xlarge node was successfully (and automatically) launched approx. 40 minutes after we saw the problem
    • We replaced all our master nodes with c5.4xlarge instances, which (currently) have better availability
    • We and AWS are still investigating longer-term and more reliable fixes

Q2 2020 (April - June)

  • Mean Time To Repair: 2h 5m

  • Mean Time To Resolve: 15h 53m

Incident on 2020-08-04 17:13

  • Key events

    • Fault occurs 2020-08-04 13:30
    • Fault detected 2020-08-04 18:13
    • Incident declared 2020-08-05 11:04
    • Resolved 2020-08-05 16:16
  • Time to repair: 5h 8m

  • Time to resolve: 9h 16m (during support hours 10:00-17:00)

  • Identified: Integration tests failed for cert-manager, apply pipeline failed showing it doesnot have permissions and divergence pipeline shows drift for live-1 components

  • Impact:

    • Increased risk for cluster failure because some of the components do not have the correct configuration needed for the live-1 production cluster
  • Context:

  • Resolution: Compare each resource configuration with the terraform state and applied the correct configuration from the code specific to kops cluster

Incident on 2020-04-15 10:58 Nginx/TLS

  • Key events

    • Fault occurs 2020-04-15 07:15
    • Fault detected 2020-04-15 13:45
    • Incident declared 2020-04-15 14:39
    • Resolved 2020-04-15 15:09
  • Status: Resolved at 2020-04-15 15:09 UTC

  • Time to repair: 0h 30m

  • Time to resolve: 5h 09m (during support hours 10:00-17:00)

  • Identified: After an upgrade of the Nginx ingresses, support for legacy TLS was dropped.

  • Impact:

    • IE11 users could not access any services running on the Cloud Platform
    • A few teams came forward with the issue :
    • LAA
    • Correspondence Tool
    • Prisoner Money
  • Context:

  • Resolution: The Nginx configuration was modified to enable TLSv1, TLSv1.1 and TLSv1.2

Q1 2020 (January - March)

  • Mean Time To Repair: 1h 22m

  • Mean Time To Resolve: 2h 36m

Incident on 2020-02-25 10:58

  • Key events

    • Fault occurs 2020-02-25 07:32
    • Team aware 2020-02-25 07:36
    • Incident declared 2020-02-25 10:58
    • Resolved 2020-02-25 17:07
  • Time to repair: 4h 9m

  • Time to resolve: 7h (during support hours 10:00-17:00)

  • Identified: During an upgrade, new masters were not coming up correctly (missing calico networking and other pods)

  • Impact:

    • Degraded kubernetes API performance (because some API calls were being directed to non-functioning masters)
    • Increased risk of cluster failure, because we were running on a single master during the incident
  • Context:

    • Upgrading from kubernetes 1.13.12 to 1.14.10, kops 1.13.2 to 1.14.1
    • The first master was replaced fine, but the second didn’t have calico and some other essential pods, and was not functioning correctly
    • Attempting to roll back the upgrade, every new master exhibited the same problem
    • Slack thread: https://mojdt.slack.com/archives/C514ETYJX/p1582628309085600
  • Resolution: The kube-system namespace has a label, openpolicyagent.org/webhook: ignore This label tells the Open Policy Agent (OPA) that pods are allowed to run in this namespace on the master nodes. Somehow, this label got removed, so the OPA was preventing pods from running on the new master nodes, as each one came up, so the new master was unable to launch essential pods such as calico and fluentd.

Incident on 2020-02-18 14:13 UTC

  • Key events

    • Fault occurs 2020-02-18 14:13
    • Incident declared 2020-02-18 14:23
    • Resolved 2020-02-18 14:59
  • Time to repair: 0h 36m

  • Time to resolve: 0h 46m

  • Identified: Pingdom reported that Prometheus was down (prometheus.cloud-platform.service.justice.gov.uk).

  • Impact:

    • The prometheus dashboard was unavailable for everyone, for the whole duration of the incident.
    • Between 2020-02-18 14:22 and 2020-02-18 14:26, prometheus could not receive metrics.
  • Context:

    • Although the Prometheus URL was unreachable, Grafana and Alertmanager were resolving.
    • There seemed to be an issue preventing requests to reach the prometheus pods.
    • Disk space and other resources, the usual suspects, were ruled out as the cause.
    • The domain name amd ingress were both valid.
    • Slack thread: https://mojdt.slack.com/archives/C514ETYJX/p1582035803248800
  • Resolution: We suspect an intermittent & external networking issue to be the cause of this outage.

Incident on 2020-02-12 11:45 UTC

  • Key events

    • Fault occurs 2020-02-12 11:45
    • Incident declared 2020-02-12 11:51
    • Resolved 2020-02-12 12:07
  • Time to repair: 0h 16m

  • Time to resolve: 0h 22m

  • Identified: Pingdom reported Concourse (concourse.cloud-platform.service.justice.gov.uk) down.

  • Context:

    • One of the engineers was deleting old clusters (he ran terraform destroy) and he wasn’t fully aware in which terraform workspace was working on. Using terraform destroy, EKS nodes/workers were deleted from the manager cluster.
    • Slack thread: https://mojdt.slack.com/archives/C514ETYJX/p1581508273080900
    • Resolution: Using terraform (terraform apply -var-file vars/manager.tfvars specifically) the cluster nodes where created and the infrastructure aligned? to the desired terraform state

About this incident log

The purpose of publishing this incident log:

  • for the Cloud Platform team to learn from incidents
  • for the Cloud Platform team and its stakeholders to track incident trends and performance
  • because we operate in the open

Definitions:

  • The words used in the timeline of an incident: fault occurs, team becomes aware (of something bad), incident declared (the team acknowledges and has an idea of the impact), repaired (system is fully functional), resolved (fully functional and future failures are prevented)
  • Incident time - The start of the failure (Before March 2020 it was the time the incident was declared)
  • Time to Repair - The time between the incident being declared (or when the team became aware of the fault) and when service is fully restored. Only includes Hours of Support.
  • Time to Resolve - The time between when the fault occurs and when system is fully functional (and include any immediate work done to prevent future failures). Only includes Hours of Support. This is a broader metric of incident response performance, compared to Time to Repair.

Source: Atlassian

Datestamps: please use YYYY-MM-DD HH:MM (almost ISO 8601, but more readable), for the London timezone

Template

Incident on YYYY-MM-DD HH:MM - [Brief description]

  • Key events

    • First detected YYYY-MM-DD HH:MM
    • Incident declared YYYY-MM-DD HH:MM
    • Repaired YYYY-MM-DD HH:MM
    • Resolved YYYY-MM-DD HH:MM
  • Time to repair: Xh Xm

  • Time to resolve: Xh Xm

  • Identified:

  • Impact:

  • Context:

  • Resolution: