Replacing Live-1 Would be Hard Because…
Treating clusters as cattle, not pets is one of our strategic goals.
The purpose of this document is to collect all the reasons we currently treat live-1 as a pet, so that we keep them top of mind, and prioritise solving them.
When we have an incident, why don’t we build a new cluster from scratch, instead of nursing live-1 back to health?
Reasons replacing live-1 is hard:
- Teams would have to adjust their deployment pipelines to target the new cluster (this doesn’t seem particularly hard)
- Teams would need to rotate their CircleCI (or whatever) service account credentials in their deployment pipelines (this doesn’t seem like that big a deal either, TBH)
- Something something cert-manager?
- Something something external-dns?
- Restoring resources we don’t control (e.g. certificates) is solved by Velero