Skip to main content

Incident on 2023-09-18 - Lack of Disk space on nodes

  • Key events

    • First detected: 2023-09-18 13:42
    • Incident declared: 2023-09-18 15:12
    • Repaired: 2023-09-18 17:54
    • Resolved 2023-09-20 19:18
  • Time to repair: 4h 12m

  • Time to resolve: 35h 36m

  • Identified: User reported that they are seeing ImagePull errors no space left on device error

  • Impact: Several nodes are experiencing a lack of disk space within the cluster. The deployments might not be scheduled consistently and may fail.

  • Context:

    • 2023-09-18 13:42 Team noticed RootVolUtilisation-Critical in High-priority-alert channel
    • 2023-09-18 14:03 User reported that they are seeing ImagePull errors no space left on device error
    • 2023-09-18 14:27 Team were doing the EKS Module upgrade to 18 and draining the nodes. They were seeing numerous pods in Evicted and ContainerStateUnKnown state
    • 2023-09-18 15:12 Incident declared. https://mojdt.slack.com/archives/C514ETYJX/p1695046332665969
    • 2023-09-18 15.26 Compared the disk size allocated in old node and new node and identified that the new node was allocated only 20Gb of disk space
    • 2023-09-18 15:34 Old default node group uncordoned
    • 2023-09-18 15:35 New nodes drain started to shift workload back to old nodegroup
    • 2023-09-18 17:54 Incident repaired
    • 2023-09-19 10:30 Team started validating the fix and understanding the launch_template changes
    • 2023-09-20 10:00 Team updated the fix on manager and later on live cluster
    • 2023-09-20 12:30 Started draining the old node group
    • 2023-09-20 15:04 There was some increased pod state of “ContainerCreating”
    • 2023-09-20 15:25 There was increased number of "failed to assign an IP address to container" eni error. Checked the CNI logs Unable to get IP address from CIDR: no free IP available in the prefix Understood that this might be because of IP Prefix starving and some are freed when draining old nodes.
    • 2023-09-20 19:18 All nodes drained and No pods are in errored state. The initial issue of disk space issue is resolved
  • Resolution:

    • Team identified that the disk space was reduced from 100Gb to 20Gb as part of EKS Module version 18 change
    • Identified the code changes to launch template and applied the fix
  • Review actions:

    • Update runbook to compare launch template changes during EKS module upgrade
    • Create Test setup to pull images similar to live with different sizes
    • Update RootVolUtilisation alert runbook to check disk space config
    • Scale coreDNS dynamically based on the number of nodes
    • Investigate if we can use ipv6 to solve the IP Prefix starvation problem
    • Add drift testing to identify when a terraform plan shows a change to the launch template
    • Setup logging to view cni and ipamd logs and setup alerts to notify when there are errors related to IP Prefix starvation