Skip to main content

Creating a live-like cluster

When testing cluster upgrades, it is useful to test the procedure which is as close to the live cluster as possible. The following steps will update an existing test cluster to the configuration similar to the live cluster.

Pre-requisites

Setting cluster size to match Live

  1. Set the node group desired size to 60 (check the live cluster for up-to-date number) in the AWS console under Compute
  2. Set the node_groups_count to same as live cluster (60) and default_ng_min_count to 60 in terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf
  3. Copy the node_size values from live to default, currently ["r6i.2xlarge", "r6i.xlarge", "r5.2xlarge"]
  4. Copy the monitoring_node_size values from live to default, currently ["r6i.8xlarge", "r5a.2xlarge"]
  5. Ensure that your Terraform workspace matches your cluster name
  6. Run terraform plan and confirm that your changes are correct
  7. Run terraform apply to apply the changes to your test cluster

Installing live components and test applications

  1. In terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components enable the following components:
    • cluster_autoscaler
    • large_nodegroup
    • kibana_proxy
    • ecr_exporter
    • cloudwatch_exporter
    • velero

To find components that are enabled in live but not in test you can search for lookup(local.live_workspace, terraform.workspace, false) in components.tf.

  1. Add the starter_pack_count = 40 variable to the starter_pack module

Sometimes terraform will error out with an unclear error message. This is usually due to a low default ulimit. To fix this, you can set ulimit -n 2048

  1. Run terraform plan and confirm that your changes are correct
  2. Run terraform apply to apply the changes to your test cluster
  3. You may need to run plan and apply again as the starter pack addons don’t like to be installed all at once

Upgrading a live-like test cluster

See documentation for upgrading a cluster.

Monitoring the upgrade

  • Setup pingdom alerts for starter-pack helloworld and multi-container app

When nodes recycle, it’s possible that the multi-container app will break giving false positives.

  • Useful command liners

    • watch -n 1 "kubectl get events" - get all Kubernetes events
    • watch -n 1 "kubectl get pods -A | grep ContainerStatusUnknown" - get all containers in “ContainerStatusUnknown” state
    • watch -n 1 "kubectl get pods -A | grep Error" - get all containers in “Error” state
    • watch -n 1 "kubectl get nodes --sort-by=\".metadata.creationTimestamp\"" - get all nodes and sort by create timestamp
  • Useful third party tools

You may refer to Monitor EKS Cluster section for more details.

Final Tests

  1. Run make run-tests from the root cloud-platform-infrastructure repository
  2. Update cluster.tf cluster_version to match version upgraded to
  3. Run terraform plan to ensure there are no unexpected changes
  4. Go to component layer and scale up and down the starter_pack module to ensure terraform apply can run smoothly

Tearing down

  1. Run the delete cluster pipeline
  2. Remove pingdom checks
This page was last reviewed on 16 October 2024. It needs to be reviewed again on 16 April 2025 by the page owner #cloud-platform .
This page was set to be reviewed before 16 April 2025 by the page owner #cloud-platform. This might mean the content is out of date.