Creating a live-like cluster

When testing cluster upgrades, it is useful to test the procedure which is as close to the live cluster as possible. The following steps will update an existing test cluster to the configuration similar to the live cluster.

Pre-requisites

a test cluster created using the cluster build pipeline or manually

Setting cluster size to match Live

Set the node group desired size to 60 (check the live cluster for up-to-date number) in the AWS console under Compute
Set the node_groups_count to same as live cluster (60) and default_ng_min_count to 60 in terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf
Copy the node_size values from live to default, currently ["r6i.2xlarge", "r6i.xlarge", "r5.2xlarge"]
Copy the monitoring_node_size values from live to default, currently ["r6i.8xlarge", "r5a.2xlarge"]
Ensure that your Terraform workspace matches your cluster name
Run terraform plan and confirm that your changes are correct
Run terraform apply to apply the changes to your test cluster

Installing live components and test applications

In terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components enable the following components:
- cluster_autoscaler
- large_nodegroup
- kibana_proxy
- ecr_exporter
- cloudwatch_exporter
- velero

To find components that are enabled in live but not in test you can search for lookup(local.live_workspace, terraform.workspace, false) in the components directory.

Add the starter_pack_count = 40 variable to the starter_pack module

Sometimes terraform will error out with an unclear error message. This is usually due to a low default ulimit. To fix this, you can set ulimit -n 2048

Run terraform plan and confirm that your changes are correct
Run terraform apply to apply the changes to your test cluster
You may need to run plan and apply again as the starter pack addons don’t like to be installed all at once

Upgrading a live-like test cluster

See documentation for upgrading a cluster.

Monitoring the upgrade

Setup pingdom alerts for starter-pack helloworld and multi-container app

When nodes recycle, it’s possible that the multi-container app will break giving false positives.

Useful command liners
- watch -n 1 "kubectl get events" - get all Kubernetes events
- watch -n 1 "kubectl get pods -A | grep ContainerStatusUnknown" - get all containers in “ContainerStatusUnknown” state
- watch -n 1 "kubectl get pods -A | grep Error" - get all containers in “Error” state
- watch -n 1 "kubectl get nodes --sort-by=\".metadata.creationTimestamp\"" - get all nodes and sort by create timestamp
Useful third party tools
- K9s
- Stern

You may refer to Monitor EKS Cluster section for more details.

Final Tests

Run make run-tests from the root cloud-platform-infrastructure repository
Update cluster.tf cluster_version to match version upgraded to
Run terraform plan to ensure there are no unexpected changes
Go to component layer and scale up and down the starter_pack module to ensure terraform apply can run smoothly

Tearing down

Run the delete cluster pipeline
Remove pingdom checks

This page was last reviewed on 28 July 2025. It needs to be reviewed again on 28 January 2026 by the page owner #cloud-platform-notify .

This page was set to be reviewed before 28 January 2026 by the page owner #cloud-platform-notify. This might mean the content is out of date.