Making changes to EKS node groups, instances types, or launch templates
You may need to make a change to an EKS cluster node group, instance type config, or launch template. Any of these changes force recycling of all nodes in a node group.
⚠️ Warning ⚠️ We need to be careful during this process as bringing up too many new nodes at once can cause node-level issues allocating IPs to pods.
We also can’t let terraform apply these changes because terraform doesn’t gracefully rollout the old and new nodes. Terraform will bring down all of the old nodes immediately, which will cause outages to users.
Process for recycling all nodes in a cluster
Briefly:
- Add the new node group - configured with low starting
minimum
anddesired
node counts - alongside the existing node groups in code (typically suffixed with the date of the changes)- Make sure to amend both default and monitoring if recyling all nodes
- Drain the old node group using the
cordon-and-drain
pipeline and allow the autoscaler to add new nodes to the new node group - Once workloads have moved over, remove the old node groups, and update the
minimum
anddesired
node counts for the new node group in code.
In more detail:
- Add a new node group with your updated changes.
Re-run the infrastructure-account/terraform-apply pipeline to update the Modsecurity Audit logs cluster. This maps roles to both old and new node group IAM roles.
- This is to avoid losing modsec audit logs from the new node group.
Note:
If recycling multiple clusters, the order is to drain
manager
default-ng
(⚠️ must be done from local terminal ⚠️) thenmonitoring
. After that,live-2
, thenlive
. Recyclemonitoring
beforedefault
.Lookup the old node group name (you can find this in the aws gui).
Cordon and drain the old node group following the instructions below:
for the
manager
cluster,default-ng
node group (These commands will cause concourse to experience a brief outage, as concourse workers move from the old node group to the new node group.):- disable auto scaling for the node group:
- This prevents new nodes spinning up in response to nodes being removed
ASG_NAME=$(aws eks --region eu-west-2 describe-nodegroup --cluster-name $KUBECONFIG_CLUSTER_NAME --nodegroup-name $NODE_GROUP_TO_DRAIN | jq -r ".nodegroup.resources.autoScalingGroups[0].name") aws autoscaling suspend-processes --auto-scaling-group-name $ASG_NAME aws autoscaling create-or-update-tags --tags ResourceId=$ASG_NAME,ResourceType=auto-scaling-group,Key=k8s.io/cluster-autoscaler/enabled,Value=false,PropagateAtLaunch=true
- Kick off the process of draining the node
kubectl get pods --field-selector="status.phase=Failed" -A --no-headers \ | awk '{print $2 " -n " $1}' \ | parallel -j1 --will-cite kubectl delete pod "{= uq =}" kubectl get nodes -l eks.amazonaws.com/nodegroup=$NODE_GROUP_TO_DRAIN \ --sort-by=metadata.creationTimestamp --no-headers \ | awk '{print $1}' \ | parallel -j1 --keep-order --delay 300 --will-cite \ cloud-platform cluster recycle-node --name {} --skip-version-check --kubecfg $KUBECONFIG --drain-only --ignore-label
- Once this command has run and all of the
manager
cluster node group’s nodes have drained, run the command to scale the node group down to 1- This will delete all of the nodes except the most recently drained node, which will be removed in a later step when the node group is deleted in code.
aws autoscaling resume-processes --auto-scaling-group-name $ASG_NAME aws eks --region eu-west-2 update-nodegroup-config \ --cluster-name manager \ --nodegroup-name $NODE_GROUP_TO_DRAIN \ --scaling-config maxSize=1,desiredSize=1,minSize=1
- disable auto scaling for the node group:
for all other node groups:
Note When making changes to the default node group in
live
, it’s handy to pause the pipelines for each of our environments for the duration of the change.cloud-platform pipeline cordon-and-drain --cluster-name <cluster_name> --node-group <old_node_group_name>
⚠️ Warning ⚠️ Because this command runs remotely in concourse, this command can’t be used to drain default ng on the manager cluster. It must be run locally while your context is set to the correct cluster.
Note: The above
cloud-platform
cli command runs this script.Raise a new pr deleting the old node group.
Re-run the infrastructure-account/terraform-apply pipeline to again to update the Modsecurity Audit logs cluster to map roles with only the new node group IAM Role.
Run the integration tests to ensure the cluster is healthy.
Notes:
- The
cloud-platform pipeline
command [cordons-and-drains-nodes] in a given node group waits 5 minutes between each drained node. - If you can avoid it try not to fiddle around with the target node group in the aws console for example reducing the desired nodes, aws deletes nodes in an unpredictable way which might cause the pipeline command to fail. Although it is possible if you need to.
Useful commands:
k9s
A useful cli tool to get a good overview of the state of the cluster. Useful commands for monitoring a cluster are listed here.
kubectl
watch kubectl get nodes --sort-by=.metadata.creationTimestamp
The above command will output all of the nodes like this:
NAME STATUS ROLES AGE VERSION
ip-172-20-124-118.eu-west-2.compute.internal Ready,SchedulingDisabled <none> 47h v1.22.15-eks-fb459a0
ip-172-20-101-81.eu-west-2.compute.internal Ready,SchedulingDisabled <none> 47h v1.22.15-eks-fb459a0
ip-172-20-119-182.eu-west-2.compute.internal Ready <none> 47h v1.22.15-eks-fb459a0
ip-172-20-106-20.eu-west-2.compute.internal Ready <none> 47h v1.22.15-eks-fb459a0
ip-172-20-127-1.eu-west-2.compute.internal Ready <none> 47h v1.22.15-eks-fb459a0
Monitoring nodes
Where nodes have a status of Ready,SchedulingDisabled
, this indicates that the nodes are cordoned off and will no longer schedule pods. Only nodes from the outdated nodes (those with old templates) should adopt this status. Nodes in a Ready
state will schedule pods. This should be any ‘old template’ node that haven’t yet been cordoned, or any ‘new template’ nodes.
When all nodes have been recycled, all nodes will all have a status of Ready
.
The cordon-and-drain
pipeline takes 5 minutes per node, so takes approximately 1 hour per 12 nodes. Expect a process that involves making changes to multiple clusters including live
to take a whole day.