Skip to main content
Cloud Platform Runbooks
Menu
Feedback / Report a problem
Documentation
GitHub
Table of contents
Search (via Google)
Search
Cloud Platform Runbooks
Cloud Platform Team Alliance
How We Work
Getting Help
Sprints and ceremonies
Firebreak
Story
Story points
The Board / Tickets
Adding tickets
Making changes to code
Reviewing/Merging PRs
Support Squad
The 🔨 Hammer of Justice
Support Tickets
Documentation
Incident Process
1. Confirm that the event constitutes an incident
2. Declare the incident
Examples
3. Assign roles
3.1 Incident Lead
3.2 Scribe
3.3 Communications Lead
Transferring roles
4. Fix the problem
5. End the incident
6. Post-incident procedure
Incident Log
Q3 2024 (July-September)
Incident on 2024-09-20 - EKS Subnet Route Table Associations destroyed
Incident on 2024-07-25
Q1 2024 (January-April)
Incident on 2024-04-15 - Prometheus restarted during WAL reload several times which resulted in missing metrics
Q4 2023 (October-December)
Incident on 2023-11-01 10:41 - Prometheus restarted several times which resulted in missing metrics
Q3 2023 (July-September)
Incident on 2023-09-18 15:12 - Lack of Disk space on nodes
Incident on 2023-08-04 10:09 - Dropped logging in kibana
Incident on 2023-07-25 15:21 - Prometheus on live cluster DOWN
Incident on 2023-07-21 09:31 - VPC CNI not allocating IP addresses
Q2 2023 (April-June)
Incident on 2023-06-06 11:00 - User services down
Q1 2023 (January-March)
Incident on 2023-02-02 10:21 - CJS Dashboard Performance
Incident on 2023-01-11 14:22 - Cluster image pull failure due to DockerHub password rotation
Incident on 2023-01-05 08:56 - CircleCI Security Incident
Q4 2022 (October-December)
Incident on 2022-11-15 16:03 - Prometheus eks-live DOWN.
Q3 2022 (July-September)
Incident on 2022-07-11 09:33 - Slow performance for 25% of ingress traffic
Q1 2022 (January to March)
Incident on 2022-03-10 11:48 - All ingress resources using *.apps.live.cloud-platform urls showing certificate issue
Incident on 2022-01-22 11:57 - some DNS records got deleted at the weekend
Q4 2021 (October to December)
Incident on 2021-11-05 - ModSec ingress controller is erroring
Q3 2021 (July-September)
Incident on 2021-09-30 - SSL Certificate Issue in browsers
Incident on 2021-09-04 22:05 - Pingdom check Prometheus Cloud-Platform - Healthcheck is DOWN
Incident on 2021-07-12 15:24 - All ingress resources using *apps.live-1 domain names stop working
Q2 2021 (April-June)
Incident on 2021-06-09 12:47 - All users are unable to create new ingress rules, following bad ModSec Ingress-controller upgrade
Incident on 2021-05-10 12:15 - Apply Pipeline downtime due to accidental destroy of Manager cluster
Q1 2021 (January - March)
No incidents declared
Q4 2020 (October - December)
Incident on 2020-10-06 09:07 - Intermittent “micro-downtimes” on various services using dedicated ingress controllers
Q3 2020 (July - September)
Incident on 2020-09-28 13:10 - Termination of nodes updating kops Instance Group.
Incident on 2020-09-21 18:27 - Some cloud-platform components destroyed.
Incident on 2020-09-07 12:54 - All users are unable to create new ingress rules
Incident on 2020-08-25 11:26 - Connectivity issues with eu-west-2a
Incident on 2020-08-14 11:01 - Ingress-controllers crashlooping
Incident on 2020-08-07 16:39 - Master node provisioning failure
Q2 2020 (April - June)
Incident on 2020-08-04 17:13
Incident on 2020-04-15 10:58 Nginx/TLS
Q1 2020 (January - March)
Incident on 2020-02-25 10:58
Incident on 2020-02-18 14:13 UTC
Incident on 2020-02-12 11:45 UTC
About this incident log
Template
Incident on YYYY-MM-DD HH:MM - [Brief description]
Impact:
Context:
Resolution:
Review actions:
Change Process in Cloud Platform
Making Changes to cloud-platform-infrastructure
Making Changes to Terraform modules
Making Changes to Helm Charts
Making changes to environments (Service Teams)
Communications
Add Concourse to a test cluster
Pre-requisites
Process
Upgrade EKS Terraform Module
Pre-requisites
Upgrade Steps
Upgrade with no breaking changes
Upgrade with breaking changes
Upgrade EKS cluster
Pre-requisites
Creating Cluster Upgrade GitHub Issues
Upgrade Steps
Compatibility Check
 Preparing for upgrade
Monitoring the upgrade
Starting the upgrade
Finishing the upgrade
Finishing touches
Upgrade EKS addons
Listing available EKS upgrades
eksctl Install
Preparing for upgrade
Starting the upgrade
Finish the upgrade
Upgrade AMI Version
Recycle all nodes
Upgrade cluster components
Planning
Testing the upgrade in a test cluster
Setup environment
Create test cluster
Run a shell in the tools image
Authenticate to the test cluster
Run the integration tests
Testing the upgrade
Things to observe when testing the upgrade
Performing the upgrade
Making changes to EKS node groups, instances types, or launch templates
Process for recycling all nodes in a cluster
Notes:
Useful commands:
Monitoring nodes
Upgrade Terraform Version
Introduction
Recommendations
Caveats
How to perform the upgrade - divide and conquer
Before the upgrade
Environments state files
Infrastructure state files
Container Images used by Cluster Components
How to update this runbook
Check current components images versions
Latest version for k8s 1.28
Latest version available
Urgency
calico-apiserver
calico-system
cert-manager
concourse
external-secrets-operator
gatekeeper-system
ingress-controllers
kube-system
kuberhealthy
kuberos
logging
monitoring
overprovision
tigera-operator
trivy-system
velero
Delete a cluster
Delete the cluster with Concourse delete-cluster pipeline
Delete an EKS cluster manually
Upgrade user components
Making changes to your module
Semantic Versioning and release
Upgrading a user component in environments
Moving components modules into core
Notes on the pipeline
 Process
Disaster Recovery
Updating Prisoner Content Hub WAF
What is .terraform.lock.hcl?
Working with .terraform.lock.hcl files
Rules
Commiting changes to the lock file
Add nodes to the AWS EKS cluster
Requirements
Cluster configuration:
Credentials rotation for auth0 apps
Preparation
1) Taint resources (terraform)
2) Apply changes within components (terraform)
3) Verifying changes
4) Update Manager cluster within components (terraform)
Monitor EKS Cluster
Monitoring with K9s
Installation
Launching K9s
Monitoring Nodes
Monitoring Pods
Monitoring Events
Further reading
Monitoring with Stern
Basic Usage
Further reading
Git-crypt
Adding new user to the keyring
Rotating the git-crypt key
Add a custom domain
Add a new Alertmanager receiver and a slack webhook
Pre-requisites
Creating a new receiver set
Information Alerts
Create a Pingdom integration id
How to create integration id (webhook)
Cloud Platform Disaster Recovery Scenarios
Losing a Namespace
Impact
Possible Cause
Restore process
Losing a Kubernetes Component or Object
Impact
Possible Cause
Restore process
Losing the whole cluster
Impact
Possible Cause
How this plan is tested:
Assumptions
Restore process
Deleted terraform state
Impact
Possible Cause
Restore process
Recovering more complex scenarios
Resolving a PartiallyFailed backup alert
Create and access bastion node.
Create bastion node
Access bastion node
Rotate User AWS Credentials
Set pingdom environment variables(Optional)
Target the live cluster
Set cluster-related environment variables
Set the namespace name
Terraform Init
Terraform Plan/Apply
2. Identify the compromised terraform object
3. Destroy the compromised key
4. Let terraform create a new key
AWS Compromised Credentials
Steps for a leaked credentials
Getting new credentials
Audit the compromised credentials
AWS Console Access
Steps to create/delete Cloud Platform team users
Activating MFA for new users
Modifying Cloud Platform users permissions
Troubleshooting for modifying Cloud Platform users permissions
Make an S3 bucket public
Delete Prometheus Metrics
1. Identify the metrics you want to delete
2. Enable the admin interface in Prometheus
3. Launch a port-forward pod
4. Forward local traffic to Prometheus
5. Use curl (in another terminal) to hit the API endpoint
6. Use this script to delete multiple metrics
7. Clean up
Manually Plan/Apply Namespace Resources in live cluster
Start in the appropriate branch of the environments repo
Set pingdom environment variables(Optional)
Target the live cluster
Set some environment variables
Set the namespace name
Terraform Init
Terraform Plan/Apply
Manually Delete Namespace Resources
Prerequisites
Environment Variables
Deleting a namespace
Locating PR number
Manage Published Grafana Dashboard Snapshots
Steps to remove Snapshot
Modsec logging architecture
Debugging
Export data from Elasticsearch into a CSV file
Workaround
Install es2csv
Usage
Open search best practices
Some general info about shards
FreeStorageSpaceTooLow alerts
Connecting to the Elastic search api
Connecting to the OpenSearch api
Verify you’re connected to the api
Remove data from Elasticsearch
Stop the breach first
Things to know
Get yourself access
Build your query
Delete by query
Removing specific logs filtered by phrase
Deleting documents stored in warm storage
Kibana PodSecurity Violations Alert
Kibana Alert/Monitor
Checking logs for PSA violations in Kibana
Fixing PSA Violations
Slack Alert
Delete terraform state lock
Command-line method
AWS Console method
Terraform command
Terraform state lock - Error refreshing state
How to Investigate Divergence Errors
Reproduce the plan
How to Investigate External-Dns Errors
Troubleshooting
Invalid Change Batch
Rate Limited / Throttled
How to Investigate PrometheusOperatorReconcile Errors
Troubleshooting
Analyze VPC Flow Logs
Recycle-node
Recycle-all-nodes
Recycling process
High level method
Revoke auth0 kubeconfig access token
1. Revoke existing tokens generated from github
2. Recreate ​​auth0_client.kubernetes
Provisioning EKS clusters
Pre-requisites
Environment Variables
Provisioning an EKS cluster with the cloud-platform CLI
Manually provisioning a cluster
1. VPC
2. Creating EKS cluster
3. Deploy core
4. Deploy components
Deleting your test cluster
Provisioning a custom cluster
Creating a live-like cluster
Pre-requisites
Setting cluster size to match Live
Installing live components and test applications
Upgrading a live-like test cluster
Monitoring the upgrade
Final Tests
Tearing down
Change load balancer alias to the interface IP’s in Route53.
Request AWS to restart the health check
Change load balancer alias
Expanding Persistent Volumes created using StatefulSets
Velero - Cluster backups and disaster recovery
Backups
Why?
What?
How?
Cloud Platform helm chart repository
Performing CRUD on Cloud Platform Helm repository
Access EKS cluster
Pre-requisites
Create kubeconfig using aws command
Create kubeconfig manually using a template
OpenSearch modsec setup
Get an audit log from modsec (when fluent-bit is not pushing to OpenSearch)
How do I check the audit log
Perform a search for the unique-id (obtained from the Kibana entry)
Debugging AWS Console read-only access issues
GitHub teams Principal Tag character limit exceeded
Terminal access to EKS managed nodes
Overview
Pre-requisites
Steps to get terminal access via Console
Steps to get terminal access via AWS cli
Access via SSH /SCP
Adding a route to connect to the MOJ Transit Gateway
Quick introduction
Transit Gateway
Making the change
Adding routes from live-2 VPC to MoJ Transit Gateway
Destroy Concourse Build Data
Overview
Steps to remove the build data
Onboarding into the Cloud Platform Team
People Team
Service Desk
Sit down with DM
Cloud Platform team
Access
Custom default-backend
Background
Creating your own custom error page
1. Create your docker image
2. Creating a service and deployment
3. Define annotations in your ingress file.
Use the platform-level error page
Serve all errors from your custom default backend
Open Policy Agent policies
Leavers Guide
Revoking Access
Digital Services
AWS Accounts
Other 3rd Party Accounts access removal
Line manager actions
Pushing logs to the SOC team
1. Cloudtrail logs
Architecture
2. live-1 VPC FlowLogs
Architecture
3. Route53 logs
Architecture
4. EKS logs
Todo
IAM User access keys rotation
Scheduled PR Reminders
Steps required for new repositories
Grafana Dashboards
Kubernetes Number of Pods per Node
Dashboard Layout
Troubleshooting
Fixing “failed to load dashboard” errors
Fixing “duplicate dashboard uid” errors
Going on call
What’s expected?
Expected:
Not expected:
Where do I start?
What do I get for being on call?
Civil servants
Cloud Platform Communications Plan
The Plan
Tips on format of communications
Examples
Things to include in incident communications
Example
Things to include in upgrade communications
Example
Sharing information with the wider Ministry of Justice and the Public
Tips and Tricks
Delete an RDS database snapshot
Check the expiration date of the SSL certificate for a live domain
Delete a “stuck” resource
Filter namespaces by specific label or annotation
Find all pods running on a specific worker node
Count pods running on all nodes
Add more RSS feeds to #cloud-platform-rss channel
Find files which don’t contain a particular string
Get AWS EC2 instance information for a node
Create a one-off job to repeat a cronjob
Hijack a concourse job container
Help when manually deleting AWS resources
Find out why your namespace isn’t deleting
Modsec false positives
Google Search Console and Indexing
Investigating blocked ingress spikes
Communication
Other Links
Debugging 101
Grafana Dashboards
Kubernetes / View / Nodes
Kubernetes / Compute Resources / Node (Pods)
Kubernetes / Views / Pods
Kubernetes Nginx Ingress Controller NextGen - DevOps Nirvana2
Add a new runbook
Elasticsearch Storage Issues
Hot/Warm Node Parity
Manually migrating failed indexes
Further Reading
Cloud Platform Runbooks
These runbooks are used and maintained by the MoJ Digital Cloud Platform team.