Skip to main content

Incident on 2020-08-14 - Ingress-controllers crashlooping

  • Key events

    • First detected 2020-08-14 10:43
    • Incident declared 2020-08-14 11:01
    • Resolved 2020-08-14 11:38
  • Time to repair: 0h 37m

  • Time to resolve: 0h 55m

  • Identified: There are 6 replicas of the ingress-controller pod and 2 out of the 6 were crashlooping. A restart of the pods did not resolve the issue. As per a normal runbook process, a recycle of all pods was required. However after restarting pods 4 and 5, they also started to crashloop. The risk was when restarting pods 5 and 6 - all 6 pods could be down and all ingresses down for the cluster.

  • Impact:

    • Increased risk for all ingresses failing in the cluster if all 6 ingress-controller pods are in a crashloop state.
  • Context:

  • Resolution: A restart of the leader ingress-controller pod was required so the other pods in the replica-set could connect and get the latest nginx.config file.