Skip to main content

Incident on 2023-06-06 - User services down

  • Key events

    • First detected: 2023-06-06 10:26
    • Incident declared: 2023-06-06 11:00
    • Repaired: 2023-06-06 11:21
    • Resolved 2023-06-06 11:21
  • Time to repair: 0h 55m

  • Time to resolve: 0h 55m

  • Identified: Several Users reported issues that the production pods are deleted all at once, and receiving pingdom alerts that their application is down for few minutes

  • Impact: User services were down for few minutes

  • Context:

    • 2023-06-06 10:23 - User reported that their production pods are deleted all at once
    • 2023-06-06 10:30 - Users reported that their services were back up and running.
    • 2023-06-06 10:30 - Team found that the nodes are being recycled all at a time during the node instance type change
    • 2023-06-06 10:50 - User reported that the DPS service is down because they couldnot authenticate into the service
    • 2023-06-06 11:00 - Incident declared
    • 2023-06-06 11:21 - User reported that the DPS service is back up and running
    • 2023-06-06 11:21 - Incident repaired
    • 2023-06-06 13:11 - Incident resolved
  • Resolution:

    • When the node instance type is changed, the nodes are recycled all at a time. This caused the pods to be deleted all at once.
    • Raised a ticket with AWS asking the steps to update the node instance type without causing outage to the services.
    • The instance type update is performed through terraform, hence the team will have to comeup with a plan and update runbook to perform these changes without downtime.
  • Review actions:

    • Add a runbook for the steps to perform when changing the node instance type