Skip to main content

Cloud Platform Runbooks

Feedback / Report a problem
Documentation
GitHub

Search (via Google)

Cloud Platform Runbooks
Cloud Platform Team Alliance
How We Work
CP30
Incident Process
Incident Log
Change Process in Cloud Platform
Add Concourse to a test cluster
- Pre-requisites
- Process
Upgrade AMI Version
Recycle all nodes
Upgrade EKS addons
Upgrade EKS cluster
Upgrade EKS Terraform Module
- Pre-requisites
- Upgrade Steps
  - Upgrade with no breaking changes
  - Upgrade with breaking changes
Making changes to EKS node groups, instances types, or launch templates
- Process for recycling all nodes in a cluster
Upgrade Terraform Version
Upgrade cluster components
- Planning
- Testing the upgrade in a test cluster
Upgrade user components
Container Images used by Cluster Components
Delete a cluster
- Delete the cluster with Concourse delete-cluster pipeline
- Delete an EKS cluster manually
Debug and Recycle Node with error loading seccomp filter errno: 524
Updating Prisoner Content Hub WAF
Moving components modules into core
What is .terraform.lock.hcl?
- Working with .terraform.lock.hcl files
  - Rules
  - Commiting changes to the lock file
Add nodes to the AWS EKS cluster
- - Requirements
  - Cluster configuration:
Credentials rotation for auth0 apps
Monitor EKS Cluster
- Monitoring with K9s
- Monitoring with Stern
  - Basic Usage
  - Further reading
Git-crypt
- Adding new user to the keyring
- Rotating the git-crypt key
Add a custom domain
Add a new Alertmanager receiver and a slack webhook
Alertmanager Receivers Checker
Create a Pingdom integration id
- How to create integration id (webhook)
Cloud Platform Disaster Recovery Scenarios
Create and access bastion node.
- Create bastion node
- Access bastion node
Rotate User AWS Credentials
Deleting RDS Option Group Error
- Problem Description
  - Scenario 1: Deleting Option Group Before RDS Instance
  - Scenario 2: Deleting Option Group After RDS Instance
- Resolution Steps
  - For Scenario 1:
  - For Scenario 2:
RDS Data Loss Recovery
- Restoration Process
ElastiCache create failed - insufficient AZ capacity
AWS Compromised Credentials
AWS Console Access
Make an S3 bucket public
Delete Prometheus Metrics
Manually Plan/Apply Namespace Resources in live cluster
Manually Delete Namespace Resources
Modsec logging architecture
- Debugging
Manage Published Grafana Dashboard Snapshots
- Steps to remove Snapshot
OpenSearch Log Restore Runbook
Remove data from OpenSearch
RDS Snapshots
Open search best practices
OpenSearch PodSecurity Violations Alert
Delete terraform state lock
Terraform state lock - Error refreshing state
How to Investigate Divergence Errors
- Reproduce the plan
How to Investigate External-Dns Errors
- Troubleshooting
  - Invalid Change Batch
  - Rate Limited / Throttled
How to Investigate PrometheusOperatorReconcile Errors
- Troubleshooting
Analyze VPC Flow Logs
Recycle-node
Recycle-all-nodes
- Recycling process
  - High level method
Revoke auth0 kubeconfig access token
- 1. Revoke existing tokens generated from github
- 2. Recreate auth0_client.kubernetes
Creating a live-like cluster
Provisioning EKS clusters
Change load balancer alias to the interface IP’s in Route53.
- Request AWS to restart the health check
- Change load balancer alias
Expanding Persistent Volumes created using StatefulSets
Velero - Cluster backups and disaster recovery
- Backups
  - Why?
  - What?
  - How?
Dependabot changes
- Prerequisites
- Steps
Cloud Platform helm chart repository
- Performing CRUD on Cloud Platform Helm repository
Create custom cluster
Expanding a PersistentVolumeClaim (PVC) for a StatefulSet
Access EKS cluster
- Pre-requisites
  - Create kubeconfig using aws command
  - Create kubeconfig manually using a template
Manually Rotate Chainguard Secret
- Overview of how we use the Chainguard image
- Rotate Chainguard Secret
Incident Response Exercises
OpenSearch modsec setup
- Get an audit log from modsec (when fluent-bit is not pushing to OpenSearch)
  - How do I check the audit log
  - Perform a search for the unique-id (obtained from the OpenSearch entry)
Debugging AWS Console read-only access issues
- GitHub teams Principal Tag character limit exceeded
Terminal access to EKS managed nodes
Onboarding into the Cloud Platform Team
Adding a route to connect to the MOJ Transit Gateway
Open Policy Agent policies
Destroy Concourse Build Data
- Overview
- Steps to remove the build data
Custom default-backend
Blocking Public IP Address from EKS Cluster
- Introduction
- Adding deny rules to the public network ACL
Leavers Guide
- Revoking Access
- Line manager actions
Pushing logs to the SOC team
Scheduled PR Reminders
- - Steps required for new repositories
Grafana Dashboards
- Kubernetes Number of Pods per Node
  - Dashboard Layout
- Troubleshooting
  - Fixing “failed to load dashboard” errors
  - Fixing “duplicate dashboard uid” errors
Going on call
Cloud Platform Communications Plan
Google Search Console and Indexing
Tips and Tricks
Add a new runbook
OpenSearch Storage Issues
How to Investigate cert-manager Errors
- Troubleshooting
  - Invalid Change Batch
  - Current fix
Debugging 101
- Grafana Dashboards
Identifying and Managing Untagged AWS Resources
- Overview
- Step-by-Step Process
Investigating blocked ingress spikes
- Communication
- Other Links

Cloud Platform Runbooks

These runbooks are used and maintained by the MoJ Digital Cloud Platform team.

View source
Report problem
GitHub Repo

All content is available under the Open Government Licence v3.0, except where otherwise stated

© Crown copyright