Skip to main content

Incident on 2020-08-07 - Master node provisioning failure

  • Key events

    • First detected 2020-08-07 15:51
    • Repaired 2020-08-07 16:29
    • Incident declared 2020-08-07 16:39
    • Resolved 2020-08-14 10:06
  • Time to repair: 0h 38m

  • Time to resolve: 33h 15m (during support hours 10:00-17:00 M-F)

  • Identified: Routine replacement of a master node failed because AWS did not have any c4.4xlarge instances available in the relevant availability zone.

  • Impact:

    • Increased risk because the cluster was running on 2 out of 3 master nodes, for a brief period
  • Context:

  • Resolution:

    • A new c4.4xlarge node was successfully (and automatically) launched approx. 40 minutes after we saw the problem
    • We replaced all our master nodes with c5.4xlarge instances, which (currently) have better availability
    • We and AWS are still investigating longer-term and more reliable fixes