Skip to main content
Cloud Platform Runbooks
Menu
Feedback / Report a problem
Documentation
GitHub
Table of contents
Search (via Google)
Search
Cloud Platform Runbooks
How We Work
Getting Help
Sprints and ceremonies
Firebreak
Story
Story points
The Board / Tickets
Adding tickets
Making changes to code
Reviewing/Merging PRs
The 🔨 Hammer of Justice
Backlog Tickets
Documentation
How to update gem files in technical documentation
When to update
How to update
Incident Process
1. Confirm that the event constitutes an incident
2. Declare the incident
Examples
3. Assign roles
3.1 Incident Lead
3.2 Scribe
3.3 Communications Lead
Transferring roles
4. Fix the problem
5. End the incident
6. Post-incident procedure
Incident Log
Q1 2023 (January-March)
Incident on 2023-02-02 10:21 - CJS Dashboard Performance
Incident on 2023-01-11 14:22 - Cluster image pull failure due to DockerHub password rotation
Incident on 2023-01-05 08:56 - CircleCI Security Incident
Q4 2022 (October-December)
Incident on 2022-11-15 16:03 - Prometheus eks-live DOWN.
Q3 2022 (July-September)
Incident on 2022-07-11 09:33 - Slow performance for 25% of ingress traffic
Q1 2022 (January to March)
Incident on 2022-03-10 11:48 - All ingress resources using *.apps.live.cloud-platform urls showing certificate issue
Incident on 2022-01-22 11:57 - some DNS records got deleted at the weekend
Q4 2021 (October to December)
Incident on 2021-11-05 - ModSec ingress controller is erroring
Q3 2021 (July-September)
Incident on 2021-09-30 - SSL Certificate Issue in browsers
Incident on 2021-09-04 22:05 - Pingdom check Prometheus Cloud-Platform - Healthcheck is DOWN
Incident on 2021-07-12 15:24 - All ingress resources using *apps.live-1 domain names stop working
Q2 2021 (April-June)
Incident on 2021-06-09 12:47 - All users are unable to create new ingress rules, following bad ModSec Ingress-controller upgrade
Incident on 2021-05-10 12:15 - Apply Pipeline downtime due to accidental destroy of Manager cluster
Q1 2021 (January - March)
No incidents declared
Q4 2020 (October - December)
Incident on 2020-10-06 09:07 - Intermittent “micro-downtimes” on various services using dedicated ingress controllers
Q3 2020 (July - September)
Incident on 2020-09-28 13:10 - Termination of nodes updating kops Instance Group.
Incident on 2020-09-21 18:27 - Some cloud-platform components destroyed.
Incident on 2020-09-07 12:54 - All users are unable to create new ingress rules
Incident on 2020-08-25 11:26 - Connectivity issues with eu-west-2a
Incident on 2020-08-14 11:01 - Ingress-controllers crashlooping
Incident on 2020-08-07 16:39 - Master node provisioning failure
Q2 2020 (April - June)
Incident on 2020-08-04 17:13
Incident on 2020-04-15 10:58 Nginx/TLS
Q1 2020 (January - March)
Incident on 2020-02-25 10:58
Incident on 2020-02-18 14:13 UTC
Incident on 2020-02-12 11:45 UTC
About this incident log
Template
Incident on YYYY-MM-DD HH:MM - [Brief description]
Impact:
Context:
Resolution:
Review actions:
Change Process in Cloud Platform
Making Changes to cloud-plarform-infrastructure
Making Changes to Terraform modules
Making Changes to Helm Charts
Making changes to environments (Service Teams)
Communications
Add Concourse to a test cluster
Pre-requisites
Process
Upgrade EKS cluster
Run the upgrade, via the tools image
Pre-requisites
Upgrade Steps
Upgrade EKS Terraform Module
Upgrade Control Plane
Upgrade Node Group(s)
Upgrade addon(s)
Troubleshooting
Upgrade cluster components
Planning
Testing the upgrade in a test cluster
Setup environment
Run a shell in the tools image
Authenticate to the test cluster
Run the integration tests
Testing the upgrade
Things to observe when testing the upgrade
Performing the upgrade
Upgrade Terraform Version
Introduction
Recommendations
Caveats
How to perform the upgrade - divide and conquer
Before the upgrade
Environments state files
Infrastructure state files
Upgrade user components
Making changes to your module
Semantic Versioning and release
Upgrading a user component in environments
Delete a cluster
Delete the cluster using the script
First, run make tools-shell
Delete the cluster using concourse fly commands
Delete an EKS cluster manually
Add nodes/change the instance type of the AWS EKS cluster
Add nodes to the eks cluster
Requirements
Cluster configuration:
Issue
Change the AWS EKS instance type (worker_node_machine_type)
Credentials rotation for auth0 apps
Preparation
1) Taint resources (terraform)
2) Apply changes within components (terraform)
3) Verifiying changes
4) Update Manager cluster within components (terraform)
Git-crypt
Adding new user to the keyring
Rotating the git-crypt key
Add a custom domain
Add a new Alertmanager receiver and a slack webhook
Pre-requisites
Creating a new receiver set
Information Alerts
Cloud Platform Disaster Recovery Scenarios
Losing a Namespace
Impact
Possible Cause
Restore process
Losing a Kubernetes Component or Object
Impact
Possible Cause
Restore process
Losing the whole cluster
Impact
Possible Cause
How this plan is tested:
Assumptions
Restore process
Deleted terraform state
Impact
Possible Cause
Restore process
Recovering more complex scenarios
Create and access bastion node.
Create bastion node
Access bastion node
Calico checklist before an upgrade
Current status
Options available
Making sure it works
Rotate User AWS Credentials
Set pingdom environment variables(Optional)
Target the live cluster
Set cluster related environment variables
Set the namespace name
Terraform Init
Terraform Plan/Apply
2. Identify the compromised terraform object
3. Destroy the compromised key
4. Let terraform create a new key
Rotate RDS Credentials
Let terraform create a new password
AWS Compromised Credentials
Steps for a leaked credentials
Getting new credentials
Audit the compromised credentials
AWS Console Access
Steps to create/delete users
Activating MFA for new users
Make an S3 bucket public
Delete Prometheus Metrics
1. Identify the metrics you want to delete
2. Enable the admin interface in Prometheus
3. Launch a port-forward pod
4. Forward local traffic to Prometheus
5. Use curl (in another terminal) to hit the API endpoint
6. Use this script to delete multiple metrics
7. Clean up
Manually Plan/Apply Namespace Resources in live cluster
Start in the appropriate branch of the environments repo
Set pingdom environment variables(Optional)
Target the live cluster
Set some environment variables
Set the namespace name
Terraform Init
Terraform Plan/Apply
Manually Delete Namespace Resources
Remove data from Elasticsearch
Stop the breach first
Things to know
Get yourself access
Build your query
Delete by query
Removing specific logs filtered by phrase
Deleting documents stored in warm storage
Export data from Elasticsearch into a CSV file
Workaround
Install es2csv
Usage
Manage Published Grafana Dashboard Snapshots
Steps to remove Snapshot
Delete terraform state lock
Command-line method
AWS Console method
Terraform command
Terraform state lock - Error refreshing state
How to Investigate Divergence Errors
Reproduce the plan
How to Investigate External-Dns Errors
Troubleshooting
Recycle-node
Recycle-all-nodes
Recycling process
Gotchas
Revoke auth0 kubeconfig access token
1. Revoke existing tokens generated from github
2. Recreate ​​auth0_client.kubernetes
Provisioning EKS clusters
Pre-requisites
Environment Variables
Provisioning
1. VPC
2. Creating EKS cluster
3. Deploy components
4. Delete the EKS cluster
Change load balancer alias to the interface IP’s in Route53.
Request AWS to restart the health check
Change load balancer alias
Expanding Persistent Volumes created using StatefulSets
Velero - Cluster backups and disaster recovery
Backups
Why?
What?
How?
Cloud Platform helm chart repository
Performing CRUD on Cloud Platform Helm repository
Access EKS cluster
Pre-requisites
Create kubeconfig using aws command
Create kubeconfig manually using a template
Get an audit log from modsec
How do I check the audit log
Perform a search for the unique-id (obtained from the Kibana entry)
Terminal access to EKS managed nodes
Overview
Pre-requisites
Steps to get terminal access via Console
Steps to get terminal access via AWS cli
Access via SSH /SCP
Adding a route to connect to a TGW
Quick introduction
Transit Gateway
Making the change
Adding live-2 VPC to PTTP TGW
Moving away from Cloud Platform Transit Gateway account
Custom default-backend
Background
Use platform-level error page
Not use platform-level error page
Onboarding into the Cloud Platform Team
People Team
Service Desk
Sit down with DM
Cloud Platform team
Access
Open Policy Agent policies
Adding a policy
Writing tests
References
Destroy Concourse Build Data
Overview
Steps to remove the build data
Leavers Guide
Revoking Access
Digital Services
AWS Accounts
Line manager actions
Scheduled PR Reminders
Steps required for new repositories
Grafana Dashboards
Kubernetes Number of Pods per Node
Dashboard Layout
Troubleshooting
Going on call
What’s expected?
Expected:
Not expected:
Where do I start?
What do I get for being on call?
Civil servants
Contractors
Cloud Platform Communications Plan
The Plan
Tips on format of communications
Examples
Things to include in incident communications
Example
Things to include in upgrade communications
Example
Sharing information with the wider Ministry of Justice and the Public
Tips and Tricks
Delete an RDS database snapshot
Check the expiration date of the SSL certificate for a live domain
Delete a “stuck” resource
Filter namespaces by specific label or annotation
Find all pods running on a specific worker node
Count pods running on all nodes
Output all records from Route53 as a CSV file
Add more RSS feeds to #cloud-platform-rss channel
Find files which don’t contain a particular string
Get AWS EC2 instance information for a node
Create a one-off job to repeat a cronjob
Hijack a concourse job container
Help when manually deleting AWS resources
Find out why your namespace isn’t deleting
Modsec false positives
Google Search Console and Indexing
Add a new runbook
Cloud Platform Runbooks
These runbooks are used and maintained by the MoJ Digital Cloud Platform team.