Skip to main content

Incident on 2021-07-12 - All ingress resources using *apps.live-1 domain names stop working

  • Key events

    • First detected 2021-07-12 15:44
    • Repaired 2021-07-12 15:51
    • Incident declared 2021-07-12 16:09
    • Resolved 2021-07-13 11:49
  • Time to repair: 0h 07m

  • Time to resolve: 20h 03m

  • Identified: User reported in #ask-cloud-platform an error from the APM monitoring platform Sentry: Hostname/IP does not match certificate's altnames

  • Impact: All ingress resources using the *apps.live-1.cloud-platform.service.justice.gov.uk have mismatched certificates.

  • Context:

    • Occurred immediately following an upgrade to the default certificate of “live” clusters (PR here: https://github.com/ministryofjustice/cloud-platform-terraform-ingress-controller/pull/20)
    • The change amended the default certificate in the live-1 cluster to *.apps.manager.cloud-platform.service.justice.gov.uk.
    • Timeline: timeline
    • Slack thread: #ask-cloud-platform for the incident, #cloud-platform for the recovery.
  • Resolution:

    • The immediate repair was simple: perform an inline edit of the default certificate in live-1. Replacing the word manager with live-1 i.e. reverting the faulty change.
    • Further investigation ensued, finding the cause of the incident was actually an underlying bug in the infrastructure apply pipeline used to perform a terraform apply against manager.
    • This bug had been around from the creation of the pipeline but had never surfaced.
    • The pipeline uses an environment variable named KUBE_CTX to context switch between clusters. This works for resources using the terraform provider, however, not for null_resources, causing the change in the above PR to apply to the wrong cluster.
  • Review actions:

    • Provide guidance on namespace to namespace traffic - using network policy not ingress (and advertise it to users) Ticker #3082
    • Monitoring the cert - Kubehealthy monitor key things including cert. Could replace several of the integration tests that take longer. Ticket #3044
    • Canary app should have #high-priority-alerts after 2 minutes if it goes down. DONE in PR #5126
    • Fix the pipeline: in the cloud-platform-cli, create an assertion to ensure the cluster name is equal to the terraform workspace name. To prevent the null-resources acting on the wrong cluster. PR exists
    • Created a ticket to migrate all terraform null_resources within our modules to terraform kubectl provider
    • Created a ticket to set terraform kubernetes credentials dynamically (at executing time)
    • Fix the pipeline: Before the creation of Terraform resources, add a function in the cli to perform a kubectl context switch to the correct cluster. PR exists