Upgrade cluster components
Cluster components are application layer components that are installed in a cluster such as prometheus, external-dns, opa, certmanager etc.
Components are configured as terraform modules and are called from cloud-platform-infrastructure repo with a release tag.
Planning
When you start working on upgrading any cloud-platform-components ticket:
- Check which chart versions are available to upgrade to
- Check whether the component is upgradeable to that chart version from the current one (some major versions cannot be skipped)
- Check the release notes of the component for any breaking changes
- Check and add notes from the upgrading process mentioned in the original github repository related the component (if any)
- Check and add notes of every CHANGELOG.md of the component between the current chart version installed in your cluster to the chart version you want to upgrade to
- Review the CHANGELOG notes with another member of the team (check for breaking changes, deprecations, change to values file and suggested plan for upgrading the production clusters
live-1
,eks-manager
andlive
)
Testing the upgrade in a test cluster
Your GPG key must be added to the cloud-platform-infrastructure repo so that you are able to run
git-crypt unlock
You have the AWS CLI profile
moj-cp
with suitable credentialsYou have docker installed
Setup environment
Setup the environment variables listed in example.env.create-cluster in the cloud-platform-infrastructure repo.
Create test cluster
Run the create-cluster
concourse job to create a test cluster.
Run a shell in the tools image
The cloud platform tools image has all the software required to update a cluster.
From a local copy of the cloud-platform-infrastructure repo, run the following command:
make tools-shell
Authenticate to the test cluster
Create the file ~/.kube/config
in your tools-image container by running:
aws eks --region eu-west-2 update-kubeconfig --name <cluster-name>
Run the integration tests
This will ensure the test cluster does not have any existing issues and is ready to use.
To run Go tests:
make run-tests
Testing the upgrade
Make the changes required to the module. For example, for upgrading the cert-manager, change the cert-manager terraform module. This might include
- The helm chart version
- changes to the values file (in needed)
Push changes to a branch(upgrade) of the module
Update the local copy of the cloud-platform-infrastructure repo with the branch reference. For cert-manager module, the code would change to
source = "github.com/ministryofjustice/cloud-platform-terraform-certmanager?ref=upgrade"
Do a
terraform plan
for the changes, verify whether the changes are correct and doterraform apply
to apply the changesCheck the things to observe section for specific components
Run the integration tests again
Once the testing is complete and integration tests are passed, create a PR to be reviewed by the team and have the module unit tests passed. After the PR is approved, merge the changes to the main branch of the module and make a release.
Change the module release tag in the eks/core/components folder of cloud-platform-infrastructure repo and raise a PR. Verify the terraform plan from the cloud-platform-infrastructure plan pipeline and get it reviewed by the team.
Once approved, merge the PR and monitor the cloud-platform-infrastructure apply pipeline when applying the changes.
Run the
reporting
tests in concourse to ensure live/manager/live-2 are working as expected.
Things to observe when testing the upgrade
Below are some of the general things to check when during the upgrade and not a complete list.
Performing the upgrade
There is a cli command to perform the upgrade. First, navigate to the cloud-platform-infrastructure repo and run the following command:
cloud-platform environments bump-module --module <module-name> --module-version <version>
The module-name
flag must contain a word in the module source. For example, if you were to upgrade the cert-manager module to 0.5.0, you would run the following command:
cloud-platform environment bump-module --module certmanager --module-version 0.5.0
Cert manager
- The CRDs for cert-manager do not get deleted
- The existing certificates do not change in any way. This can be done by checking the timestamps of certificate creation
- Able to create and validate new certificates
Prometheus
- Ensure correct CRDs versions of prometheus-operator are updated before upgrading prometheus-operator
- The existing PrometheusRules and ServiceMonitors do not change in any way. This can be done by checking the timestamps of those resourse creation
- Able to query prometheus for metrics
Logs
Compare the logs for the updated component/addon between test cluster and live to ensure there are no additional warnings or errors.
Compare environment
variables
Check the components/addon deamonset to ensure that the environment variables match.