Skip to main content

Incident on 2020-09-07 - All users are unable to create new ingress rules

  • Key events

    • First detected 2020-09-07 12:39
    • Incident declared 2020-09-07 12:54
    • Resolved 2020-09-07 15:56
  • Time to repair: 3h 02m

  • Time to resolve: 3h 17m

  • Identified: The Ingress API refused 100% of POST requests.

  • Impact:

    • If a user were to provision a new service, they would be unable to create an ingress into the cluster.
  • Context:

    • Version 0.1.0 of the teams ingress controller module enabled the creation of a validationwebhookconfiguration resource.
    • By enabling this option we created a single point of failure for all ingress-controller pods in the ingress-controller namespace.
    • A new 0.1.0 ingress controller failed to create in the “live-1” cluster due to AWS resource limits.
    • Validation webhook stopped new rules from creating, with the error: Error from server (InternalError): error when creating "ingress.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubernetes.io": Post offender-categorisation-prod-nx-controller-admission.ingress-controllers.svc:443/extensions/v1beta1/ingresses?timeout=30s: x509: certificate signed by unknown authority
    • Initial investigation thread: https://mojdt.slack.com/archives/C514ETYJX/p1599478794246900
    • Incident declared: https://mojdt.slack.com/archives/C514ETYJX/p1599479640251900
  • Resolution: The team manually removed the all the additional admission controllers created by 0.1.0. They then removed the admission webhook from the module and created a new release (0.1.1). All ingress modules currently on 0.1.0 were upgraded to the new release 0.1.1.