Skip to main content

Recycle-all-nodes

When a launch template is updated, this will cause all of the nodes to recycle. Reasons to update the launch configuration are most likely to revolve around editing the User data script, which is ran when a node is booted. Reasons to edit User data include:

  • Changes to docker auth credentials
  • Changes to eks-bootstrap-env.sh
  • Changes to kubelet environment variables

Recycling process

Avoid letting terraform run EKS-level changes because terraform can start by deleting all the current nodes and then recreating them causing an outage to users.

High level method

  1. Add the new node group with a low number of nodes alongside the existing node groups in code
  2. Drain the old node group using the pipeline and allow the autoscaler to bring in new nodes into the new node group
  3. Once workloads have moved over remove the old node groups from code

detailed instructions can be found here

Useful commands

k9s is a useful cli tool to get a good overview of the state of the cluster

  • watch kubectl get nodes --sort-by=.metadata.creationTimestamp

The above command will output all of the nodes like this:

NAME                                           STATUS   ROLES    AGE     VERSION
ip-172-20-124-118.eu-west-2.compute.internal   Ready,SchedulingDisabled      <none>   47h     v1.22.15-eks-fb459a0
ip-172-20-101-81.eu-west-2.compute.internal    Ready,SchedulingDisabled      <none>   47h     v1.22.15-eks-fb459a0
ip-172-20-119-182.eu-west-2.compute.internal   Ready    <none>   47h     v1.22.15-eks-fb459a0
ip-172-20-106-20.eu-west-2.compute.internal    Ready    <none>   47h     v1.22.15-eks-fb459a0
ip-172-20-127-1.eu-west-2.compute.internal     Ready    <none>   47h     v1.22.15-eks-fb459a0

Where nodes have the Status “Ready,SchedulingDisabled” this indicates the nodes which have the old update templates, these are the ones that are cordoned off, whereas those which are “Ready” are the new nodes with the new update template.

When all nodes have been recycled they will all have a status of “Ready”.

This process can take several hours on a cluster of ~60 nodes, depending on how quickly you resolve the gotchas below.

This page was last reviewed on 16 August 2024. It needs to be reviewed again on 16 February 2025 by the page owner #cloud-platform .
This page was set to be reviewed before 16 February 2025 by the page owner #cloud-platform. This might mean the content is out of date.