Incident on 2020-09-28 - Termination of nodes updating kops Instance Group

Key events
- First detected 2020-09-28 13:14
- Incident declared 2020-09-28 14:05
- Repaired 2020-09-28 14:20
- Resolved 2020-09-28 14:44
Time to repair: 0h 15m
Time to resolve: 1h 30m
Identified: Periods of downtime while the cloud-platform team was applying per Availability Zone instance groups for worker nodes change in live-1. Failures caused mainly due to termination of a group of 9 nodes and letting kops to handle the cycling of pods, which took very long time for the new containers to be created in the new node group.
Impact:
- Some users noticed cycling of pods but taking a long time for the containers to be created.
- Prometheus/alertmanager/kibana health check failures.
- Users noticed short-lived pingdom alerts & health check failures.
Context:
- kops node group (nodes-1.16.13) updated minSize from 25 to 18 nodes and ran kops update cluster –yes, this terminated 9 nodes from existing worker node group (nodes-1.16.13).
- Pods are in pending status for a long time waiting to be scheduled in the new nodes.
- Teams using their own ingress-controller have 1 replica for non-prod namespaces, causing some pingdom alerts & health check failures.
- Timeline: Timeline for the incident.
- Slack thread: Slack thread for the incident.
Resolution:
- This is resolved by cordoning and draining nodes one by one before deleting the instance group.