Skip to main content

Incident on 2021-05-10 - Apply Pipeline downtime due to accidental destroy of Manager cluster

  • Key events

    • First detected 2021-05-10 12:15
    • Incident not declared, but later agreed it was one
    • Repaired 2021-05-10 16:48
    • Resolved 2021-05-11 10:00
  • Time to repair: 4h 33m

  • Time to resolve: 4h 45m

  • Identified: CP team member did ‘terraform destroy components’, intending it to destroy a test cluster, but it was on Manager cluster by mistake. Was immediately aware of the error.

  • Impact:

    • Users couldn’t create or change their namespace definitions or AWS resources, due to Concourse being down
  • Context:

  • Resolution:

    • Manager cluster was recreated.
    • During this we encountered a certificate issue with Concourse, so it was restored manually. The terraform had got out of date for the Manager cluster.
    • Route53 zones were hard-coded and had to be changed manually.
  • Actions following review:

    • Spike ways to avoid applying to wrong cluster - see 3 options above. Ticket #3016
    • Try ‘Prevent destroy’ setting on R53 zone - Ticket #2899
    • Disband the cloud-platform-concourse repository. This includes Service accounts, and pipelines. We should split this repository up and move it to the infra/terraform-concourse repos. Ticket #3017
    • Manager needs to use our PSPs instead of eks-privilege - this has already been done.