Creating a live-like cluster
When testing cluster upgrades, it is useful to test the procedure which is as close to the live cluster as possible. The following steps will update an existing test cluster to the configuration similar to the live cluster.
Pre-requisites
- a test cluster created using the cluster build pipeline or manually
Setting cluster size to match Live
- Set the node group desired size to 60 (check the live cluster for up-to-date number) in the AWS console under Compute
- Set the node_groups_count to same as live cluster (60) and default_ng_min_count to 60 in terraform/aws-accounts/cloud-platform-aws/vpc/eks/cluster.tf
- Copy the node_size values from live to default, currently
["r6i.2xlarge", "r6i.xlarge", "r5.2xlarge"]
- Copy the monitoring_node_size values from live to default, currently
["r6i.8xlarge", "r5a.2xlarge"]
- Ensure that your Terraform workspace matches your cluster name
- Run
terraform plan
and confirm that your changes are correct - Run
terraform apply
to apply the changes to your test cluster
Installing live components and test applications
- In terraform/aws-accounts/cloud-platform-aws/vpc/eks/core/components enable the following components:
- cluster_autoscaler
- large_nodegroup
- kibana_proxy
- ecr_exporter
- cloudwatch_exporter
- velero
To find components that are enabled in live but not in test you can search for
lookup(local.live_workspace, terraform.workspace, false)
incomponents.tf
.
- Add the
starter_pack_count = 40
variable to the starter_pack module
Sometimes terraform will error out with an unclear error message. This is usually due to a low default
ulimit
. To fix this, you can setulimit -n 2048
- Run
terraform plan
and confirm that your changes are correct - Run
terraform apply
to apply the changes to your test cluster - You may need to run
plan
andapply
again as the starter pack addons don’t like to be installed all at once
Upgrading a live-like test cluster
See documentation for upgrading a cluster.
Monitoring the upgrade
- Setup pingdom alerts for starter-pack helloworld and multi-container app
When nodes recycle, it’s possible that the multi-container app will break giving false positives.
Useful command liners
watch -n 1 "kubectl get events"
- get all Kubernetes eventswatch -n 1 "kubectl get pods -A | grep ContainerStatusUnknown"
- get all containers in “ContainerStatusUnknown” statewatch -n 1 "kubectl get pods -A | grep Error"
- get all containers in “Error” statewatch -n 1 "kubectl get nodes --sort-by=\".metadata.creationTimestamp\""
- get all nodes and sort by create timestamp
Useful third party tools
You may refer to Monitor EKS Cluster section for more details.
Final Tests
- Run
make run-tests
from the root cloud-platform-infrastructure repository - Update
cluster.tf
cluster_version
to match version upgraded to - Run
terraform plan
to ensure there are no unexpected changes - Go to
component
layer and scale up and down thestarter_pack
module to ensureterraform apply
can run smoothly
Tearing down
- Run the delete cluster pipeline
- Remove pingdom checks