Incident on 2020-08-07 - Master node provisioning failure
Key events
- First detected 2020-08-07 15:51
- Repaired 2020-08-07 16:29
- Incident declared 2020-08-07 16:39
- Resolved 2020-08-14 10:06
Time to repair: 0h 38m
Time to resolve: 33h 15m (during support hours 10:00-17:00 M-F)
Identified: Routine replacement of a master node failed because AWS did not have any c4.4xlarge instances available in the relevant availability zone.
Impact:
- Increased risk because the cluster was running on 2 out of 3 master nodes, for a brief period
Context:
- Lack of availability of a given instance type is not a failure mode for which we have planned
- In theory, if a problem occurs which eventually kills each master node in turn, and if instances of the right type are not available in at least 2 availability zones, this could bring down the whole cluster.
- Timeline : https://docs.google.com/document/d/1SOAOeL-89cuK-_fJbtgYcArInWQY7UXiIDY7wN5gjuA/edit# ttps://docs.google.com/document/d/1kxKwC1B_pnlPbysS0zotbXMKyZcUDmDtnGbEyIHGvgQ/edit#heading=h.z3py6eydx4qu)
- Slack thread: https://mojdt.slack.com/archives/C514ETYJX/p1596814746202600
Resolution:
- A new c4.4xlarge node was successfully (and automatically) launched approx. 40 minutes after we saw the problem
- We replaced all our master nodes with c5.4xlarge instances, which (currently) have better availability
- We and AWS are still investigating longer-term and more reliable fixes