Skip to main content

Incident on 2020-08-25 - Connectivity issues with eu-west-2a

  • Key events

    • First detected 2020-08-25 11:01
    • Incident declared 2020-08-25 11:26
    • Resolved 2020-08-25 12:11
  • Time to repair: 0h 45m

  • Time to resolve: 1h 10m

  • Identified: The AWS Availability Zones eu-west-2a, which contain some of our kubernetes nodes had an outage. API latency was elevated, some EC2 became unreachable and overall connectivity was unstable.

  • Impact:

    • Two kubernetes nodes became unreachable
    • No new node could be launched in eu-west-2a
    • Kubernetes had issues talking to some of these nodes, preventing some API calls to succeed (Pods were not terminating)
    • New pods were not able to pull their Docker images.
  • Context:

    • Pods and Nodes sitting in other Availability Zones (b & c) were not impacted
    • Slack threads: Issue detected, Incident Declared,
    • We now have 25 pods in the cluster, instead of 21
  • Resolution: The incident was mitigated by deploying more 2-4 nodes in healthy Availability Zones, manually deleting the non-responding pods, and terminating the impacted nodes