RDS Snapshots

RDS snapshots have multiple purposes: migrations, backups, etc. When RDS DB instances are created using the Cloud Platform RDS terraform module, an IAM user account is created for management purposes. This user can create, delete, copy, and restore RDS snapshots.

This team runbook provides technical guidance for testing and verifying the RDS snapshot restoration process.

Prerequisites

Before performing an RDS restore from snapshot:

Ensure you have AWS CLI access with appropriate permissions
Verify the existence of snapshots for the target database
Identify the correct DB instance identifier
Check any existing snapshots with potential naming conflicts

Considerations

The amount of manual snapshots per AWS account is limited, so it’s important to cleanup old snapshots
Daily snapshots are provided “out of the box”, and do not count towards the “Manual Snapshots” total
Managing snapshots is the teams’ responsibility (as are snapshot restores), so teams are responsible for cleaning up unneeded manual snapshots in order to avoid hitting our AWS account limits

Restoring live services from an RDS DB snapshot

If you want to restore your production RDS DB instance from a previous snapshot, either because the database is corrupted or the whole database was deleted, here is the procedure:

1. Get the Database Instance Identifier

aws rds describe-db-instances --query 'DBInstances[*].[DBInstanceIdentifier]' --output text

Or, alternatively:

cloud-platform decode-secret -n <your namespace> -s <the secret storing your RDS details>

Look for a line like this in the output:

{
  "rds_instance_address": "cloud-platform-abcdefghij.abcdefghijkl.eu-west-2.rds.amazonaws.com"
}

The db-instance-identifier is the cloud-platform-abcdefghij part.

For more detailed information about a specific instance:

aws rds describe-db-instances --db-instance-identifier <db-instance-identifier>

Example output will include critical information:

{
  "DBInstanceIdentifier": "cloud-platform-klmno12345",
  "DBInstanceClass": "db.t4g.micro",
  "Engine": "mysql",
  "DBInstanceStatus": "available",
  "MasterUsername": "cpuser123",
  "DBName": "dbexample123",
  "Endpoint": {
    "Address": "cloud-platform-klmno12345.cdwm328dlye6.eu-west-2.rds.amazonaws.com",
    "Port": 3306,
    "HostedZoneId": "Z1TTGA775OQIYO"
  },
  "AllocatedStorage": 20,
  "InstanceCreateTime": "2025-02-18T15:32:07.509000+00:00",
  "PreferredBackupWindow": "03:29-03:59",
  "BackupRetentionPeriod": 7
}

2. List Available Snapshots

aws rds describe-db-snapshots --db-instance-identifier <db-instance-identifier> --query 'DBSnapshots[*].[DBSnapshotIdentifier,SnapshotCreateTime]' --output text

This will list all available snapshots for the specified database along with their creation timestamps.

5. Add SKIP_FILE to Pause Pipeline

Important: Before making any changes to the RDS configuration, you must first create a PR to add a SKIP_FILE:

Create a PR to add an APPLY_PIPELINE_SKIP_THIS_NAMESPACE file at the root of the namespace
Merged it
Update your local repo with the latest changes from main branch

This ensures the pipeline doesn’t try to apply changes while you’re in the middle of the restoration process.

4. Create a Test Database with Sample Data (For Testing)

Before proceeding with any live restoration, it’s recommended to test the process:

3.1 Seed Test Data Process

Create a port-forward pod to establish connection with the RDS database:

   kubectl -n <namespace> run -it --rm --image=mysql:5.7 mysql-client -- bash

Connect to the RDS database:

   mysql -h127.0.0.1 -P3306 -u<username> -p<password>

Example: bash mysql -h127.0.0.1 -P3306 -ucpuser123 -pmysecretpassword

Select your database:

   USE <database_name>;

Example: sql USE dbexample123;

Create a test table:

   CREATE TABLE IF NOT EXISTS restore_test (
     id INT AUTO_INCREMENT PRIMARY KEY,
     info VARCHAR(255)
   );

Seed test data:

   INSERT INTO restore_test (info) VALUES ('Test entry 1'), ('Test entry 2');

Verify data insertion:

   SELECT * FROM restore_test;

3.2 Create a Manual Snapshot

Before making any changes, create a manual snapshot as a safety measure:

aws rds create-db-snapshot \
  --db-instance-identifier <db-instance-identifier> \
  --db-snapshot-identifier <custom-identifier-with-date>

5. Verify Storage Configuration

Check the current storage allocation settings for the database:

aws rds describe-db-instances --db-instance-identifier <db-instance-identifier>

Note the AllocatedStorage and MaxAllocatedStorage values. By default, these are set to 20GB and 500GB respectively if not explicitly defined in the module.

6. Modify the RDS Module for Restoration

Create a PR to modify the RDS module in the rds.tf file to include:

module "example_team_rds" {
  source = "github.com/ministryofjustice/cloud-platform-terraform-rds-instance?ref=5.1"
  # ... other parameters ...
  snapshot_identifier = "<your-snapshot-identifier>"
  # ... other parameters ...
}

Important: When using a snapshot for restoration, do not attempt to change storage parameters at the same time. The RDS instance will adopt the storage configuration from the snapshot.

It is crucial to retain other original settings associated with the DB instance when restoring from the snapshot. Details such as db-engine-version, db-engine, rds-family must be the same as the original instance when the snapshot was taken.

Do not merge this PR yet - there are additional steps required first.

7. Rename the Current Database Instance Before Merging

As a Cloud Platform team member handling this restoration request:

Identify the DB instance identifier provided by the user requesting the restoration
Rename the current database in the AWS console:
- Navigate to AWS Console → RDS → Databases
- Select the target database
- Choose “Modify”
- Append .old or a similar suffix to the DB instance identifier
- Apply the change immediately
Once the DB has been renamed, merge the PR with the RDS changes once approved.

This process preserves the database endpoint and credentials when the new instance is created from the snapshot.

8. Handle Final Snapshot Conflicts

The pipeline may fail due to final snapshot name conflicts. This occurs because Terraform attempts to create a final snapshot with the same name during recreation.

Why This Happens:

When you delete an RDS instance, AWS/Terraform automatically names the final snapshot as {db-instance-identifier}-finalsnapshot.

Example: If your DB instance is named mydb, the final snapshot becomes mydb-finalsnapshot.

Conflict Scenario:
- You delete mydb → Final snapshot mydb-finalsnapshot is created.
- Later, you create a new mydb instance, modify it, and try to delete it again.
- Terraform/AWS attempts to create mydb-finalsnapshot again → Conflict.

If the pipeline fails with the error DBSnapshotAlreadyExists, execute:

aws rds delete-db-snapshot --db-snapshot-identifier <db-instance-identifier>-finalsnapshot

Then rerun the failed pipeline.

Expected Timeline and Status Changes

Total downtime: ~26 minutes from Deleting to Available. This may vary depending on how much data is being restored.

Verification After Restoration

Once the new DB instance is restored and available, verify that the data has been successfully restored:

Create a port-forward pod to establish connection with the restored RDS database:

   kubectl -n <namespace> run -it --rm --image=mysql:5.7 mysql-client -- bash

Connect to the database to verify data restoration:

   mysql -h $DB_HOST -u $DB_USERNAME -p$DB_PASSWORD $DB_NAME

Or use a direct connection:

   mysql -h127.0.0.1 -P3306 -u<username> -p<password>

Run basic queries to confirm data accessibility:

   SHOW DATABASES;
   USE <database_name>;
   SHOW TABLES;
   SELECT * FROM restore_test;  # Check if test data is present

Verify the connection string remains the same:

   cloud-platform decode-secret -n <your namespace> -s <the secret storing your RDS details>

Look for the rds_instance_endpoint value and confirm it matches the original endpoint.

Version Downgrade Limitations and Process

Important: Version Downgrades Cannot Be Done In-Place

AWS does not support downgrading a database to an earlier version on the same instance. For example, if a database has been upgraded from PostgreSQL 16.3 to 17, it cannot be downgraded back to 16.3 using the standard snapshot restoration process outlined above.

Example of a Failed Downgrade Scenario

The following failed scenario was observed and documented:

User had upgraded their RDS from PostgreSQL 16.3 to 17
User wanted to restore back to 16.3 using the process:
- Found an older snapshot identifier from before the upgrade
- Raised PR with the snapshot identifier
- Manually renamed the database in the console
- Merged the PR

What actually happened: * Terraform apply reverted the manual rename * Started deleting the database, which also deleted all automatic snapshots * First apply pipeline failed with a parameter group error * Second attempt failed because the snapshot no longer existed (deleted during first apply)

Correct Process for Version Downgrades

To properly downgrade an RDS database to a previous version:

Take a snapshot of the current database to preserve it (in case restoration is needed later):

   aws rds create-db-snapshot \
     --db-instance-identifier <current-db-instance-identifier> \
     --db-snapshot-identifier <current-version-snapshot-name>

Create a new database instance with:
- A different identifier than the current database
- Engine version matching the snapshot you want to restore from
- RDS family matching the snapshot you want to restore from
- All other configuration parameters consistent with the source of the snapshot
In the terraform module:

   module "example_team_rds" {
     source               = "github.com/ministryofjustice/cloud-platform-terraform-rds-instance?ref=5.1"
     # ... other parameters ...
     db_instance_identifier = "<new-instance-name>"  # Must be different from current DB
     db_engine_version      = "16.3"  # Earlier version
     rds_family             = "postgres16"  # Family matching earlier version
     snapshot_identifier    = "<old-version-snapshot>"
     # ... other parameters ...
   }

Note: This approach requires a cutover as the new database will have a different endpoint and connection string.

After verifying the new downgraded database works properly, the old one can be deleted if needed.

Known Issues and Solutions

Final Snapshot Naming Conflicts

Issue: The pipeline fails because it cannot create a final snapshot with the same name as an existing snapshot.

Error Message: DBSnapshotAlreadyExists

Solution: - Delete the conflicting snapshot manually:

  aws rds delete-db-snapshot --db-snapshot-identifier <db-instance-identifier>-finalsnapshot

Engine Version Conflicts

Issue: Sometimes the pipeline throws an error: InvalidParameterCombination: Cannot upgrade mariadb from 10.4.13 to 10.4.8

Solution: This will happen if AWS upgraded the minor version automatically and terraform stored the exact db_engine_version in its state.

To fix this, change the db_engine_version in your rds.tf from 10.4 to the one AWS upgraded to, in this case 10.4.13, and repeat the previous steps again.

Once the terraform apply is successful, and the new DB instance has been created, revert the changes made for db_engine_version.

Read Replica Issues

Issue: If there is an RDS read replica for the DB instance that is being restored, Terraform will throw the error: Error: cannot elect new source database for replication

Solution: Set count = 0 or comment out the read replica code (and any Kubernetes secrets relating to the read replica) before restoration. After successful restoration, re-enable the DB read replica code by setting count = 1 or uncommenting the code.

Note that the read replica will have a new database identifier once it is recreated.

Metrics and Observations

Based on our testing:

Downtime: Expect ~26 minutes of database unavailability. This may vary depending on how much data is being restored.
Connection String: Remains unchanged after restoration (provided the restore is on the same database instance)
Storage Configuration: Inherits from the snapshot
Final Snapshots: May require manual cleanup to prevent pipeline failures
Status Progression: Follows predictable pattern: Deleting → Creating → Modifying → Rebooting → Backing-up → Modifying → Available

This page was last reviewed on 3 March 2025. It needs to be reviewed again on 3 September 2025 by the page owner #cloud-platform .

This page was set to be reviewed before 3 September 2025 by the page owner #cloud-platform. This might mean the content is out of date.