Skip to main content

Making changes to EKS node groups, instances types, or launch templates

You may need to make a change to an EKS cluster node group, instance type config, or launch template. Any of these changes force recycling of all nodes in a node group.

⚠️ Warning ⚠️ We need to be careful during this process as bringing up too many new nodes at once can cause node-level issues allocating IPs to pods.

We also can’t let terraform apply these changes because terraform doesn’t gracefully rollout the old and new nodes. Terraform will bring down all of the old nodes immediately, which will cause outages to users.

Process for recycling all nodes in a cluster

Briefly:

  1. Add the new node group - configured with low starting minimum and desired node counts - alongside the existing node groups in code (typically suffixed with the date of the changes)
    • Make sure to amend both default and monitoring if recyling all nodes
  2. Drain the old node group using the cordon-and-drain pipeline and allow the autoscaler to add new nodes to the new node group
  3. Once workloads have moved over, remove the old node groups, and update the minimum and desired node counts for the new node group in code.

In more detail:

  1. Add a new node group with your updated changes.
  2. Re-run the infrastructure-account/terraform-apply pipeline to update the Modsecurity Audit logs cluster. This maps roles to both old and new node group IAM roles.

    • This is to avoid losing modsec audit logs from the new node group.

    Note:

    If recycling multiple clusters, the order is to drain manager default-ng (⚠️ must be done from local terminal ⚠️) then monitoring. After that, live-2, then live. Recycle monitoring before default.

  3. Lookup the old node group name (you can find this in the aws gui).

  4. Cordon and drain the old node group following the instructions below:

    • for the manager cluster, default-ng node group (These commands will cause concourse to experience a brief outage, as concourse workers move from the old node group to the new node group.):

      • disable auto scaling for the node group:
        • This prevents new nodes spinning up in response to nodes being removed
      ASG_NAME=$(aws eks --region eu-west-2 describe-nodegroup --cluster-name $KUBECONFIG_CLUSTER_NAME --nodegroup-name $NODE_GROUP_TO_DRAIN  | jq -r ".nodegroup.resources.autoScalingGroups[0].name")
      aws autoscaling suspend-processes --auto-scaling-group-name $ASG_NAME
      
      aws autoscaling create-or-update-tags --tags ResourceId=$ASG_NAME,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=false,PropagateAtLaunch=true
      
      • Kick off the process of draining the node
      kubectl get pods --field-selector="status.phase=Failed" -A --no-headers \
        | awk '{print $2 " -n " $1}' \
        | parallel -j1 --will-cite kubectl delete pod "{= uq =}"
      
      kubectl get nodes -l eks.amazonaws.com/nodegroup=$NODE_GROUP_TO_DRAIN \
        --sort-by=metadata.creationTimestamp --no-headers \
        | awk '{print $1}' \
        | parallel -j1 --keep-order --delay 300 --will-cite \
        cloud-platform cluster recycle-node --name {} --skip-version-check --kubecfg $KUBECONFIG --drain-only --ignore-label
      
      • Once this command has run and all of the manager cluster node group’s nodes have drained, run the command to scale the node group down to 1
        • This will delete all of the nodes except the most recently drained node, which will be removed in a later step when the node group is deleted in code.
        aws autoscaling resume-processes --auto-scaling-group-name $ASG_NAME
      
        aws eks --region eu-west-2 update-nodegroup-config \
          --cluster-name manager \
          --nodegroup-name $NODE_GROUP_TO_DRAIN \
          --scaling-config maxSize=1,desiredSize=1,minSize=1
      
    • for all other node groups:

    Note When making changes to the default node group in live, it’s handy to pause the pipelines for each of our environments for the duration of the change.

    cloud-platform pipeline cordon-and-drain --cluster-name <cluster_name> --node-group <old_node_group_name>
    

    ⚠️ Warning ⚠️ Because this command runs remotely in concourse, this command can’t be used to drain default ng on the manager cluster. It must be run locally while your context is set to the correct cluster.

    Note: The above cloud-platform cli command runs this script.

  5. Raise a new pr deleting the old node group.

  6. Re-run the infrastructure-account/terraform-apply pipeline to again to update the Modsecurity Audit logs cluster to map roles with only the new node group IAM Role.

  7. Run the integration tests to ensure the cluster is healthy.

Notes:

  • The cloud-platform pipeline command [cordons-and-drains-nodes] in a given node group waits 5 minutes between each drained node.
  • If you can avoid it try not to fiddle around with the target node group in the aws console for example reducing the desired nodes, aws deletes nodes in an unpredictable way which might cause the pipeline command to fail. Although it is possible if you need to.

Useful commands:

k9s

A useful cli tool to get a good overview of the state of the cluster. Useful commands for monitoring a cluster are listed here.

kubectl

  • watch kubectl get nodes --sort-by=.metadata.creationTimestamp

The above command will output all of the nodes like this:

NAME                                           STATUS   ROLES    AGE     VERSION
ip-172-20-124-118.eu-west-2.compute.internal   Ready,SchedulingDisabled      <none>   47h     v1.22.15-eks-fb459a0
ip-172-20-101-81.eu-west-2.compute.internal    Ready,SchedulingDisabled      <none>   47h     v1.22.15-eks-fb459a0
ip-172-20-119-182.eu-west-2.compute.internal   Ready    <none>   47h     v1.22.15-eks-fb459a0
ip-172-20-106-20.eu-west-2.compute.internal    Ready    <none>   47h     v1.22.15-eks-fb459a0
ip-172-20-127-1.eu-west-2.compute.internal     Ready    <none>   47h     v1.22.15-eks-fb459a0

Monitoring nodes

Where nodes have a status of Ready,SchedulingDisabled, this indicates that the nodes are cordoned off and will no longer schedule pods. Only nodes from the outdated nodes (those with old templates) should adopt this status. Nodes in a Ready state will schedule pods. This should be any ‘old template’ node that haven’t yet been cordoned, or any ‘new template’ nodes.

When all nodes have been recycled, all nodes will all have a status of Ready.

The cordon-and-drain pipeline takes 5 minutes per node, so takes approximately 1 hour per 12 nodes. Expect a process that involves making changes to multiple clusters including live to take a whole day.

This page was last reviewed on 13 September 2024. It needs to be reviewed again on 13 March 2025 by the page owner #cloud-platform .
This page was set to be reviewed before 13 March 2025 by the page owner #cloud-platform. This might mean the content is out of date.