Incident on 2025-10-13 - RDS t4g instances unavailable

Key events
- First detected 2025-10-13 08:59
- Incident declared 2025-10-13 10:27
- Repaired 2025-10-13 17:38
- Resolved 2025-10-13 18:56
Identified:

Via User

Background:

On the morning of Monday 13 October we received reports of RDS instances failing to start due to an error from AWS when starting instances that had been stopped overnight:

“Insufficient instance capacity for instance type db.t4g.micro in availability zone eu-west-2b; putting database instance into stopping”

Impact:

RDS instances of t4g class. Of the affected instances were used for pre-production environments, e.g. Staging, development, test, etc. Development teams were impacted by their instances being in an unavailable state.

Context:

Confirmed the issue via the AWS console

Declared an incident via an announcement in #cloud-platform update

Started re-starting

Raised an issue with AWS at 10:29

Disabled the RDS auto start/stop at 16:42

Continued to restart instances, with some coming back up

All instances back up at 17:38

Resolution:
All instances back up at 17:38
AWS confirmed capacity is back at 18:56
Review actions:

Can we detect a recurrence of this issue? Unknown, create a ticket to investigate potential ways to spot and alert. Maybe not worth the effort for such a rare event. Maybe create a concourse job or an OpenSearch alert

Could we react differently?

No, took a bit of time between the first message in #ask and declaring an incident

Do we need dev databases to be multi-AZ? Look into this, make multi-az the default. Can we enable / permit single AZ RDS instances?

Can teams use containerised DBs?

Write some comms describing the incident with reasons we can’t / couldn’t do anything about it before / after the fact