Skip to main content

Incident on 2021-06-09 - All users are unable to create new ingress rules, following bad ModSec Ingress-controller upgrade

  • Key events

    • First detected 2021-06-09 13:15
    • Repaired 2021-06-09 13:46
    • Incident declared 2020-06-09 13:54
    • Resolved 2021-06-09 13:58
  • Time to repair: 0h 31m

  • Time to resolve: 0h 43m

  • Identified: User reported in #ask-cloud-platform an error when deploying UAT application: kind Ingress: Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post https://modsec01-nx-modsec-admission.ingress-controllers.svc:443/networking/v1beta1/ingresses?timeout=10s: x509: certificate is valid for modsec01-nx-controller-admission, modsec01-nx-controller-admission.ingress-controllers.svc, not modsec01-nx-modsec-admission.ingress-controllers.svc

  • Impact: It blocked all ingress API calls, so no new ingresses could be created, nor changes to current ingresses could be deployed, which included all user application deployments.

  • Context:

    • Occurred immediately following an upgrade to the ModSec Ingress-controller module v3.33.0, which apparently successfully deployed
    • It caused any new ingress or changes to current ingresses to be blocked by the ModSec Validation webhook
    • Timeline: Timeline for the incident.
    • Slack thread: #ask-cloud-platform for the incident, #cloud-platform for the recovery.
  • Resolution: Rollback to ModSec Ingress-controller module v0.0.7

  • Review actions:

    • Find out why this issue didn’t get flagged in the test cluster - try to reproduce the issue - maybe need another test? Ticket #2972
    • Add test that checks the alerts in alertmanager in smoke tests. Ticket #2973
    • Add helloworld app that uses modsec controller, for the smoke tests to check traffic works. Ticket #2974
    • Modsec module, new version, needs to be working on EKS for live-1 and live (neither old or new version work on live). Ticket #2975