Incident on 2021-06-09 - All users are unable to create new ingress rules, following bad ModSec Ingress-controller upgrade
Key events
- First detected 2021-06-09 13:15
- Repaired 2021-06-09 13:46
- Incident declared 2020-06-09 13:54
- Resolved 2021-06-09 13:58
Time to repair: 0h 31m
Time to resolve: 0h 43m
Identified: User reported in #ask-cloud-platform an error when deploying UAT application:
kind Ingress: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://modsec01-nx-modsec-admission.ingress-controllers.svc:443/networking/v1beta1/ingresses?timeout=10s: x509: certificate is valid for modsec01-nx-controller-admission, modsec01-nx-controller-admission.ingress-controllers.svc, not modsec01-nx-modsec-admission.ingress-controllers.svc
Impact: It blocked all ingress API calls, so no new ingresses could be created, nor changes to current ingresses could be deployed, which included all user application deployments.
Context:
- Occurred immediately following an upgrade to the ModSec Ingress-controller module v3.33.0, which apparently successfully deployed
- It caused any new ingress or changes to current ingresses to be blocked by the ModSec Validation webhook
- Timeline: Timeline for the incident.
- Slack thread: #ask-cloud-platform for the incident, #cloud-platform for the recovery.
Resolution: Rollback to ModSec Ingress-controller module v0.0.7
Review actions:
- Find out why this issue didn’t get flagged in the test cluster - try to reproduce the issue - maybe need another test? Ticket #2972
- Add test that checks the alerts in alertmanager in smoke tests. Ticket #2973
- Add helloworld app that uses modsec controller, for the smoke tests to check traffic works. Ticket #2974
- Modsec module, new version, needs to be working on EKS for live-1 and live (neither old or new version work on live). Ticket #2975