Skip to main content

Upgrade EKS cluster

The Cloud Platform EKS cluster upgrade involves upgrading any of the below:

  • Upgrade EKS Terraform Module
  • Upgrade EKS version (Control Plane and Node Groups)
  • Upgrade addon(s)
  • Upgrade AMI version

The Cloud Platform EKS clusters are created using the official terraform-aws-eks module. The EKS version and addons are currently independent of the version of the terraform-aws-eks module. Therefore, it will not always require an upgrade of the terraform-aws-eks module and/or the addons whenever there is an upgrade of the EKS version. Please check the changelogs for the terraform-aws-eks module, the EKS version and the addons when planning an upgrade.

Pre-requisites

Before you begin, there are a few pre-requisites:

  • Your GPG key must be added to the infrastructure repo so that you can run git-crypt unlock.

  • You have the AWS CLI profile moj-cp with suitable credentials.

  • You have terraform and docker installed

  • Review the changelog of the Kubernetes release and the EKS release you are planning to upgrade to.

  • Review the official EKS upgrading a cluster document for any extra steps that are a part of a specific EKS release.

  • Run kubent against cluster to find deprecated APIs.

Upgrade Steps

Compatibility Check

The following areas need to be looked into to determine if there’s any additional preparation work to do:

  • Kubernetes API Deprecations/Removals
  • EKS module
  • EKS addons
  • Components

Tools:

For Kubernetes API deprecations or removals you can use kubent and pluto to scan the cluster and find if there are any resources impacted in upcoming releases.

From the AWS console you can also see “Upgrade Insights” which has a break down of API deprecations and removals. You can drill down into specific versions and see the resources effected. In particular, the User Agent field here can be useful for tracking down API calling services.

Sometimes the User Agent ID isn’t clear enough to immediately identify where the resource is effected, if this is the case it’s worth cross checking components or helm chart versions. Additionally, you can head over to CloudWatch > Log groups > /aws/eks/[cluster-name] and view the kube-apiserver-audit logs, and filter by the userAgent field, which can help determine the source of the API calls.

Users will need to be notified if their resources are affected by API deprecations or removals.

 Preparing for upgrade

Communication is an important part of the upgrade procedure, make sure to update #ask-cloud-platform and #cloud-platform-update when commencing the upgrade. Create a thread in #cloud-platform to keep the team updated on the current status of the upgrade.

Pause the following pipelines:

  • bootstrap
  • infrastructure-live
  • infrastructure-live-2
  • infrastructure-manager

Update cluster.tf in cloud-platform-infrastructure with the version of Kubernetes you are upgrading to.

Run a tf plan against the cluster your upgrading to check to see if everything is expected, the only changes should be to resources relating to the the version upgrade.

IMPORTANT: Do not run tf apply this will most likely time out and fail. Upgrades are manually carried out through the AWS Console.

Monitoring the upgrade

Before you start the upgrade it is useful to have a few monitoring resources up and running so you can catch any issues quickly.

k9s is a useful tool to have open in a few terminal windows, the following views are helpful:

  • nodes - see nodes recycling and coming up with new version
  • events - check to see if there are any errors
  • pods - you can use vim style searching to see pods in Error state.

When a node group version changes, this will cause all of the nodes to recycle. When AWS recycles the nodes, it will not evict pods if it will break the PDB. This will cause the node to stall the update and the nodes will not continue to recycle.

To rectify this, run the script mentioned in Recycle-all-nodes Gotchas section.

This kibana dashboard is used to monitor the IP assignment for pods when they are rescheduled. If there is a spike in errors then the could be a starvation of IP address while scheduling pods.

Starting the upgrade

As with preparing for the upgrade communication is really important, keep the thread in #cloud-platform up to date as much as possible.

Increasing coredns pods

To ensure that coredns stays up and running during the cluster upgrade replications should be scaled up to 10.

Upgrading the control pane

Log in to the AWS console and select the EKS cluster we’re going to upgrade.

In the top right corner there should be a button called Upgrade now, click that and ensure the correct Kubernetes version is selected then press Update.

Control pane updates usually take 10 minutes to run.

Upgrading the monitoring node group

From the cluster control panel select Compute tab.

Select Upgrade now next to the monitoring node group.

For update strategy select “Force update”

Click Update

Upgrading the default node group

From the cluster control panel select Compute tab.

Select Upgrade now next to the monitoring node group.

For update strategy select “Force update”

Click Update

Once the upgrade has completed notify the Slack channels.

Finishing the upgrade

Create a new pull request in the cloud-platform-infrastructure repo with the updated version strings.

Unpause the following pipelines in this order and check to make sure no changes are present:

  1. infrastructure-live-2
  2. infrastructure-manager
  3. infrastructure-live

If there are no changes for terraform shown in each pipeline then the PR can be merged in.

Unpause the bootstrap pipeline.

Scale down the coredns pods.

Finishing touches

The kubectl version in the cloud-platform-cli and cloud-platform-tools-image needs updating to match the current Kubernetes version.

Documentation used as part of the upgrade should be reviewed and refined if needed.

This page was last reviewed on 24 January 2024. It needs to be reviewed again on 24 April 2024 by the page owner #cloud-platform .
This page was set to be reviewed before 24 April 2024 by the page owner #cloud-platform. This might mean the content is out of date.