Skip to main content

Running Kops Update and Rollingupdate

We want to ensure Kops is updated regularly so we periodically run Kops update. It essentially updates various components / configuration coming from kops.

Note: This is for a kops update not an upgrade. A new release upgrade is described in a separate runbook.

Using kops to change the cluster

Please do not be tempted to make changes to our production cluster(s) using kops edit...

Our cluster state is set by:

github code → generated kops yaml file → kops s3 config store → kops state → the cluster

Using kops edit skips the first two links of this chain, and causes problems the next time we try to update the cluster ‘properly’, by starting with a change to the code on github.

Instead, please use the procedure below, so that the code in github is always the source of truth for the state of the cluster(s), and so that all changes have been properly recorded and reviewed as pull requests.


Before you begin, there are a few pre-reqs:

  • You are on the right cluster to run the Kops update.
$ kubectl config current-context
  • Your are on the right AWS_PROFILE where the k8s cluster is created.
$ export AWS_PROFILE=xxxx
  • Set the –state flag or export KOPS_STATE_STORE
$ export KOPS_STATE_STORE=s3://<bucket>
  • Launch a shell using the tools image
$ cd cloud-platform-infrastructure; make tools-shell

This will ensure you have the correct versions of all required software.

for live-1 KOPS_STATE_STORE=s3://cloud-platform-kops-state

Updating the kops state

The kops state file is generated from a template by the terraform code in this directory. The key files are:

  • templates/kops.yaml.tpl

Generating an updated kops config file

After changing the files above, you need to generate a new kops/[cluster name].yaml file.

From your shell on the tools image:

kops export kubecfg [cluster name]
cd terraform/cloud-platform
terraform workspace select [cluster name]
terraform init
terraform apply

At this point, you should have an updated kops/[cluster name].yaml file in your working copy of the infrastructure repository. Commit and push your changes, including those in the terraform source and template files.

Applying local changes to the kops state store

Backup the remote kops state

If you want to preserve a backup of the current kops state on S3, you can do so like this:

aws s3 cp --recursive $KOPS_STATE_STORE/ $KOPS_STATE_STORE/live-1-backup

This takes a couple of minutes. At time of writing, this was around 10G of data (mostly etcd backup tarballs).

While the backup exists, kops get clusters will report “live-1” twice. This doesn’t do any harm, but you do need to delete the backup in order to resolve this, either via the aws CLI too, or using the AWS web interface for S3.

Updating the remote kops state

To apply your local changes to the kops state stored on S3:

kops replace -f kops/[cluster name].yaml

WARNING: You will not receive any confirmation of this step.

After this, you should be able to do a kops update cluster... and kops rolling-update cluster... as described below, to apply your changes to the cluster.

Running Kops update

Run kops update cluster Without --yes to check the configuration changes for the cluster, it will show you a preview of what it is going to do, this is handy to check what the changes are before applying the changes. Sometimes you will also have to kops rolling-update cluster to roll out the configuration immediately.

kops update cluster ${CLUSTER_NAME}

Once you are happy, kops update cluster --yes to apply the changes.

kops update cluster ${CLUSTER_NAME} --yes

It should report

Cluster changes have been applied to the cloud.

Running Kops rolling-update

This command updates a kubernetes cluster to match the cloud and kops specifications.

To perform a rolling update, you need to update the cloud resources first with the command kops update cluster.

If rolling-update does not report that the cluster needs to be rolled, you can force the cluster to be rolled with the force flag.

Rolling update drains and validates the cluster by default. A cluster is deemed validated when all required nodes are running and all pods in the kube-system namespace are operational. When a node is deleted, rolling-update sleeps the interval for the node type, and then tries for the same period of time for the cluster to be validated.

For instance, setting –master-interval=4m causes rolling-update to wait for 4 minutes after a master is rolled, and another 4 minutes for the cluster to stabilize and pass validation.

kops rolling-update cluster ${CLUSTER_NAME} --yes

It should report below steps for all the nodes in the cluster.

Draining the node: “${node}”.
node/${node} cordoned
WARNING: Deleting pods with local storage: ${pod names}
pod/${pod1 name} evicted
pod/${pod2 name} evicted
pod/${podn name} evicted
Waiting for 1m30s for pods to stabilize after draining.
deleting node “${node}” from kubernetes
Stopping instance “${aws-instance}”, node  “${node}”, in group "nodes.${CLUSTER_NAME}" (this may take a while).
waiting for 4m0s after terminating instance
Validating the cluster.
Cluster did not pass validation, will try again in "30s" until duration "5m0s" expires: machine “${aws-instance}” has not yet joined cluster.
Cluster validated.

Open new tab and watch the status of the rolling-update

watch -n10 kubectl get nodes --sort-by .metadata.creationTimestamp

Once the rolling-update is completed it should report

Rolling update completed for cluster ${CLUSTER_NAME}!

Check the cluster status with:

$ kops validate cluster

It should report

Your cluster ${CLUSTER_NAME} is ready

Possible Errors

While draining a node there is a chance that rolling update might be stuck at evicting a pod, then get the node it is stuck on and identify the pods running on there and delete the pod manually.

$ kubectl get pods -owide --all-namespaces | grep ${node_name}
This page was last reviewed on 28 September 2021. It needs to be reviewed again on 28 December 2021 by the page owner #cloud-platform .
This page was set to be reviewed before 28 December 2021 by the page owner #cloud-platform. This might mean the content is out of date.