Skip to main content

Upgrade a cluster


When you start working on an upgrade ticket:

  • Check which versions are available to upgrade to
  • Check whether the current cluster is upgradeable to that version (skipping a major version is not recommended)
  • Check the release notes of the kops and kubernetes version for any breaking changes
  • Check and add notes of every of kops between the current kops version of your cluster to the version you want to upgrade to
  • Check and add notes of every of kubernetes between the current kubernetes version of your cluster to the version you want to upgrade to (there might be changes between minor versions which might not be mentioned in the major version CHANGELOG)
  • Review the CHANGELOG notes with another member of the team (check for breaking changes, deprecations and suggested alternatives for any addons/components)


  • Your GPG key must be added to the infrastructure repo so that you are able to run git-crypt unlock

  • You have the AWS CLI profile moj-cp with suitable credentials

  • You have docker installed

Setup environment

Setup the environment variables listed in example.env.create-cluster in the infrastructure repo.

Run the integration tests

Run the integration test pipeline and wait for the tests to pass. This will ensure the live cluster is functioning as expected.

Sanity checks

Ensure Concourse pipelines are green, and pause all of them.

Take a snapshot of

  • all non-running pods; you can do this by running kubectl get pods -A | grep -v "Running".
  • triggered Prometheus alarms; take a screenshot of the current triggered alarms.

Run a shell in the tools image

The cloud platform tools image has all the software required to update a cluster.

The image will need updates for new client tool versions, see this PR for an example. Make the required changes and create a new release (which will use a github action to push the new image to DockerHub).

From a local copy of the cloud-platform-infrastructure repository, run the following command (after editing the makefile to ensure that you’re using the new version of the tools image):

make tools-shell

Run the upgrade

Authenticate to the cluster

Create the file ~/.kube/config in your tools-image container by running:

kops export kubecfg

Update the live cluster manifest

Open the cloud-platform-infrastructure repository and make the following changes to the kops/live-1.yaml manifest:

  • The Kubernetes release version, ensuring it is supported by Kops - verify on the kops releases page.
  • The AMI in each instance group. This can be found here

Make the same changes to the template file here, and create a new release of the module.

The new release must be also referenced in our definition/instance file.

Push changes to kops state

Run the following to push the above changes to the kops state store in S3.

kops replace -f kops/live-1.yaml kops update cluster

Review the changes before committing with:

kops update cluster -yes

Perform a rolling-update

This is a delicate procedure, so this article will proceed with extreme caution.

First, get the instance groups:

kops get ig

Update each master individually:

kops rolling-update cluster --instance-group <master-instance-group> (--yes)

Kops will safely eject the instance from the cluster and bring back a new one.

Once the first new master is in a ready state, run the fast integration tests with:

cd cloud-platform-infrastructure/smoke-tests
make test-fast

This will take around 30 seconds, and all the tests should pass.

During an upgrade, we had an issue where kube-system namespace label got removed, which resulted in the OPA webhook blocking us from applying daemonsets to the master node. If the above make test-fast fails for that reason, apply the annotation back from here

Perform the same process on the other two masters.

Take stock of current running worker nodes

Create a nodes file listing all the current worker nodes:

kubectl get nodes | grep node | sed 's/ .*//' > nodes

Create new worker instance groups

To ensure minimal application downtime, we’re going to create new worker instance groups with the changes we applied to the live manifest above:

kops create instancegroup nodes-<kubernetes-version>-eu-west-2a
kops create instancegroup nodes-<kubernetes-version>-eu-west-2b
kops create instancegroup nodes-<kubernetes-version>-eu-west-2c
kops create instancegroup 2xlarge-nodes-<kubernetes-version>     # <--- Monitoring/Prometheus nodes
kops update cluster
kops update cluster --yes

We use the kubernetes-version to identify the new node group.

New instances will be created, which you’ll see with:

watch kubectl get nodes

Cordon old workers

Create a script to cordon the old nodes, using the file we created earlier:

cat nodes | sed 's/^/kubectl cordon /' > cordon_nodes

Execute the new file:

bash cordon_nodes

Scale the number of ingress controllers

To ensure ingress controllers are running on the new nodes run:

kubectl -n ingress-controllers edit deployment nginx-ingress-acme-ingress-nginx-controller

…and double the number of replicas.

Drain the old worker nodes

Create a script to drain the old nodes, using the file we created earlier:

cat nodes | sed 's/^/kubectl drain /' | sed 's/$/ --ignore-daemonsets --delete-local-data/' > drain_nodes

Execute the new file:

bash drain_nodes

Each node will drain all pods, moving them onto the new workers.

Validate the cluster

Ensure all nodes have joined the cluster:

kops validate cluster

Run the integration tests again

Run the integration test pipeline and wait for the tests to pass. This will ensure the live cluster is functioning as expected.

Delete the old instance groups

When complete and all pods have migrated onto the new workers. Delete the old instance groups:

kops delete instancegroup nodes-<old-kubernetes-version>-eu-west-2a
kops delete instancegroup nodes-<old-kubernetes-version>-eu-west-2b
kops delete instancegroup nodes-<old-kubernetes-version>-eu-west-2c
kops delete instancegroup 2xlarge-nodes-<kubernetes-version>
kops update cluster
kops update cluster --yes

Scale the ingress-controllers down

We can now scale the ingress-controllers back down.

kubectl -n ingress-controllers edit deployment nginx-ingress-acme-controller

…and reduce the number of replicas.

Commit changes back to main branch

Any changes made to the cloud-platform-infrastructure repository should be PRd for review. This includes manifest changes and image tag changes.

This page was last reviewed on 2 November 2021. It needs to be reviewed again on 2 February 2022 by the page owner #cloud-platform .
This page was set to be reviewed before 2 February 2022 by the page owner #cloud-platform. This might mean the content is out of date.