Incident on 2022-03-10 - All ingress resources using *.apps.live.cloud-platform urls showing certificate issue
Key events
- First detected 2022-03-10 11:48
- Incident declared 2022-03-10 11.50
- Repaired 2022-03-10 11:56
- Resolved 2022-03-10 11.56
Time to repair: 8m
Time to resolve: 8m
Identified: Users reported in #ask-cloud-platform that they are seeing errors for CP domain urls.
Hostname/IP does not match certificate's altnames
Impact: All ingress resources using the *apps.live.cloud-platform.service.justice.gov.uk have mismatched certificates.
Context:
- Occurred immediately following a terraform apply to a test cluster
- The change amended the default certificate of
live
cluster to*.apps.yy-1003-0100.cloud-platform.service.justice.gov.uk
. - Timeline: timeline for the incident
- Slack thread: #ask-cloud-platform for the incident
Resolution:
- The immediate repair was to perform an inline edit of the default certificate in
live
. Adding the wildcard dnsNames*.apps.live
,*.live
,*.apps.live-1
and*.live-1
to the default certificate i.e. reverting the faulty change. - Further investigation followed finding the cause of the incident was actually was due to the environment variable KUBE_CONFIG set to the config path which had
live
context set - The terraform kubectl provider used to apply
kubectl_manifest
resources uses environment variableKUBECONFIG
andKUBE_CONFIG_PATH
. But it has been found that it can also use variableKUBE_CONFIG
causing the apply of certificate to the wrong cluster.
- The immediate repair was to perform an inline edit of the default certificate in
Review actions:
- Ticket raised to configure kubectl provider to use data source #3589