Skip to main content

Incident on 2020-10-06 - Intermittent “micro-downtimes” on various services using dedicated ingress controllers

  • Key events

    • First detected 2020-10-06 08:33
    • Incident declared 2020-10-06 09:07
    • Repaired 2020-10-06 10:41
    • Resolved 2020-10-06 17:19
  • Time to repair: 2h 8m

  • Time to resolve: 8h 46m

  • Identified: User reported service problems in #ask-cloud-platform. Confirmed by checking Pingdom

  • Impact:

    • Numerous brief and intermittent outages for multiple (but not all) services (production and non-production) which were using dedicated ingress controllers
  • Context:

    • Occurred immediately after upgrading live-1 to kubernetes 1.17
    • 1.17 creates 2 additional SecurityGroupRules per ingress-controller, this took us over a hard AWS limit
    • Timeline: Timeline for the incident.
    • Slack thread: Slack thread for the incident.
  • Resolution:

    • Migrate all ingresses back to the default ingress controller