Incident on 2021-07-12 - All ingress resources using *apps.live-1 domain names stop working
Key events
- First detected 2021-07-12 15:44
- Repaired 2021-07-12 15:51
- Incident declared 2021-07-12 16:09
- Resolved 2021-07-13 11:49
Time to repair: 0h 07m
Time to resolve: 20h 03m
Identified: User reported in #ask-cloud-platform an error from the APM monitoring platform Sentry:
Hostname/IP does not match certificate's altnames
Impact: All ingress resources using the *apps.live-1.cloud-platform.service.justice.gov.uk have mismatched certificates.
Context:
- Occurred immediately following an upgrade to the default certificate of “live” clusters (PR here: https://github.com/ministryofjustice/cloud-platform-terraform-ingress-controller/pull/20)
- The change amended the default certificate in the
live-1
cluster to*.apps.manager.cloud-platform.service.justice.gov.uk
. - Timeline: timeline
- Slack thread: #ask-cloud-platform for the incident, #cloud-platform for the recovery.
Resolution:
- The immediate repair was simple: perform an inline edit of the default certificate in
live-1
. Replacing the wordmanager
withlive-1
i.e. reverting the faulty change. - Further investigation ensued, finding the cause of the incident was actually an underlying bug in the infrastructure apply pipeline used to perform a
terraform apply
against manager. - This bug had been around from the creation of the pipeline but had never surfaced.
- The pipeline uses an environment variable named
KUBE_CTX
to context switch between clusters. This works for resources using theterraform provider
, however, not fornull_resources
, causing the change in the above PR to apply to the wrong cluster.
- The immediate repair was simple: perform an inline edit of the default certificate in
Review actions:
- Provide guidance on namespace to namespace traffic - using network policy not ingress (and advertise it to users) Ticker #3082
- Monitoring the cert - Kubehealthy monitor key things including cert. Could replace several of the integration tests that take longer. Ticket #3044
- Canary app should have #high-priority-alerts after 2 minutes if it goes down. DONE in PR #5126
- Fix the pipeline: in the cloud-platform-cli, create an assertion to ensure the cluster name is equal to the terraform workspace name. To prevent the null-resources acting on the wrong cluster. PR exists
- Created a ticket to migrate all terraform null_resources within our modules to terraform kubectl provider
- Created a ticket to set terraform kubernetes credentials dynamically (at executing time)
- Fix the pipeline: Before the creation of Terraform resources, add a function in the cli to perform a
kubectl context
switch to the correct cluster. PR exists