Skip to main content

Create a new cluster

You can create a new kops test cluster using the cluster build pipeline.

Alternatively, if you need more control over the test cluster parameters, or you just prefer to do it manually, the rest of this document describes the process.

Pre-requisites

Before you begin, there are a few pre-requisites:

  • Your GPG key must be added to the infrastructure repo so that you are able to run git-crypt unlock (the script will run this for you, but you must be able to do it)

  • You have an AWS CLI profile moj-cp, with suitable credentials

  • You have docker installed

Environment Variables

See the file example.env.create-cluster in the infrastructure repo. This shows examples of the environment variables which must be set in order to run the create-cluster.rb script to create a new cluster.

You can get the auth0 values from the terraform-provider-auth0 application on auth0.

Run the build script, via the tools image

The cloud platform tools image has all the software required to run the build script and create a cluster.

Start from the root directory of a working copy of the infrastructure repo.

With your environment variables set, launch a bash shell on the tools image:

make tools-shell

Execute the script, supplying the desired name of your new cluster. e.g.:

./create-cluster.rb --name david-test1

NB: Your cluster name must be no more than 12 characters. Any longer, and some of the computed strings which include the cluster name will exceed their maximum allowed values. The error messages you get if this happens are unhelpful. In order to prevent this, the build script will fail immediately if you supply a name which is too long.

See our cluster naming policy for information on how to choose a suitable name for your cluster.

By default, the script will create a ‘small’ cluster. This means the master and worker EC2 instances will be less powerful machine types than in our production cluster, and you’ll get 3 worker nodes. To create a ‘production’ size cluster you have to add your cluster name and values inside these terraform map variables:

variable "cluster_node_count_a" {
  description = "The number of worker node in the cluster in Availability Zone eu-west-2a"
  default = {
    live-1  = "9"
    default = "1"
  }
}

variable "cluster_node_count_b" {
  description = "The number of worker node in the cluster in Availability Zone eu-west-2b"
  default = {
    live-1  = "9"
    default = "1"
  }
}

variable "cluster_node_count_c" {
  description = "The number of worker node in the cluster in Availability Zone eu-west-2c"
  default = {
    live-1  = "9"
    default = "1"
  }
}

variable "master_node_machine_type" {
  description = "The AWS EC2 instance types to use for master nodes"
  default = {
    live-1  = "c5.4xlarge"
    default = "c5.large"
  }
}

variable "worker_node_machine_type" {
  description = "The AWS EC2 instance types to use for worker nodes"
  default = {
    live-1  = "r5.xlarge"
    default = "r5.large"
  }
}

In this terraform map you can specify different machine types (for masters and nodes) and also the number of worker nodes (cluster_node_count_a|b|c) for your cluster.

You do not need to commit/PR this change, if you are just specifying values for a single test cluster.

You can see more options to use when creating the cluster by running:

./create-cluster.rb --help

The script takes around 30 minutes to execute. At the end, you should see output like this:

Run options: exclude {:cluster=>"live-1"}
..........................

Finished in 4 minutes 53.4 seconds (files took 4.54 seconds to load)
26 examples, 0 failures

2019-10-07 09:07:49 kubectl cluster-info
Kubernetes master is running at https://api.david-test1.cloud-platform.service.justice.gov.uk
KubeDNS is running at https://api.david-test1.cloud-platform.service.justice.gov.uk/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

The section before kubectl cluster-info is the cluster integration tests, which run at the end of the build process. The number of tests will change, so the output will vary from what is shown here.

Each dot in the line represents a passing test. If any tests fail, you will see more output explaining the nature of the failure.

Any test failures could indicate a problem with your test cluster, which will need to be resolved before you can confidently use it for anything.

Alerts

The build script replaces the webhook values in the source code with dummy values, to prevent alerts from test clusters from showing up in the team slack channels.

If you want to reverse this, you need to revert the changes to these files, and then do a terraform apply:

  • high priority alerts: terraform/cloud-platform-components/terraform.tfvars (look for pagerduty_config)
  • lower priority alerts: terraform/cloud-platform-components/terraform.tfvars (look for alertmanager_slack_receivers)

Apply the changes

Use terraform to apply the changes to the prometheus operator in your test cluster.

Ensure your terraform workspace and kubernetes context are pointing at your test cluster

cd terraform/cloud-platform-components
terraform workspace list

If you are not targeting your test cluster workspace, run

terraform workspace select [your cluster]

Check your kubernetes context by running

kubectl cluster-info

The output should be

Kubernetes master is running at https://api.XXXXXXX.cloud-platform.service.justice.gov.uk
KubeDNS is running at https://api.XXXXXXX.cloud-platform.service.justice.gov.uk/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

…where XXXXXXX is the name of your test cluster.

If your context is pointing at live-1, switch to your test cluster by running

kops export kubecfg XXXXXXX.cloud-platform.service.justice.gov.uk

When you are confident that both your terraform workspace and kubernetes context are set to your test cluster, apply the change to set the pagerduty config value:

terraform apply -target=helm_release.prometheus_operator

Answer yes if the planned changes look correct.

Authenticating

If you want to authenticate to your new cluster in the same way as a regular user would (i.e. logging in via auth0, rather than using kops export kubecfg...), the auth0 login URL is:

https://login.apps.[cluster name].cloud-platform.service.justice.gov.uk

e.g. for a cluster called “david-test1”, the URL to authenticate and download a ~/.kube/config file is:

https://login.apps.david-test1.cloud-platform.service.justice.gov.uk

Docker authentication

Due to Docker Hub rate limiting, it may be necessary to add authentication to your Docker clients running in a Kops cluster.

The following error message will appear if your cluster has been rate limited by Docker Hub:

Failed to pull image \"docker.io/prometheus:1.0"

ERROR: toomanyrequests: Too Many Requests.

Adding a Docker config file to your cluster

If you wish to authenticate to Docker Hub, you can use the argument -d and specify the target Docker config file to pass, i.e.

./create-cluster.rb --name test-01 -d /path/to/local/docker/config.json

To generate a local docker/config.json file you need to use docker login from the command line.

This page was last reviewed on 13 September 2021. It needs to be reviewed again on 13 December 2021 by the page owner #cloud-platform .
This page was set to be reviewed before 13 December 2021 by the page owner #cloud-platform. This might mean the content is out of date.