Skip to main content

Incident on 2023-07-25 - Prometheus on live cluster DOWN

  • Key events

    • First detected: 2023-07-25 14:05
    • Incident declared: 2023-07-25 15:21
    • Repaired: 2023-07-25 15:55
    • Resolved 2023-09-25 15:55
  • Time to repair: 1h 50m

  • Time to resolve: 1h 50m

  • Identified: PagerDuty High Priority alert from Pingdom that Prometheus - live healthcheck is DOWN

  • Impact: Prometheus is not Available. The Cloud Platform lose the monitoring for a period of time.

  • Context:

    • 2023-07-25 14:05 - PagerDuty High Priority alert from Pingdom that Prometheus - live healthcheck is DOWN. Team acknowledged and checked the state of the Prometheus server. Prometheus errored for Rule evaluation and Exit code 137
    • 2023-07-25 14:09: Prometheus pod is in terminating state
    • 2023-07-25 14:17: The node where prometheus is running went to Not Ready state
    • 2023-07-25 14:22: Drain the monitoring node which moved the prometheus to the another monitoring node
    • 2023-07-25 14:56: After moving to new node the prometheus restarted just after coming back and put the node to Node Ready State
    • 2023-07-25 15:11: Comms went to cloud-platform-update on Prometheus was DOWN
    • 2023-07-25 15:20: Team found that the node memory is spiking to 89% and decided to go for a bigger instance size
    • 2023-07-25 15:21: Incident declared. https://mojdt.slack.com/archives/C514ETYJX/p1690294889724869
    • 2023-07-25 15:31: Changed the instance size to r6i.4xlarge
    • 2023-07-25 15:50: Still the Prometheus restarted after running. Team found the recent prometheus pod was terminated with OOMKilled. Increased the memory limits 100Gi
    • 2023-07-25 16:18: Updated the prometheus container limits:CPU - 12 core and 110 Gi Memory to accommodate the resource need for prometheus
    • 2023-07-25 16:18: Incident repaired
    • 2023-07-05 16:18: Incident resolved
  • Resolution:

    • Due to increase number of namespaces and prometheus rules, the prometheus server needed more memory. The instance size was not enough to keep the prometheus running.
    • Updating the node type to double the cpu and memory and increasing the container resource limit of prometheus server resolved the issue
  • Review actions:

    • Add alert to monitor the node memory usage and if a pod is using up most of the node memory #4538