Skip to main content

Incident on 2021-09-04 - Pingdom check Prometheus Cloud-Platform - Healthcheck is DOWN

  • Key events

    • First detected 2021-09-04 22:05
    • Repaired 2021-09-05 12:16
    • Incident declared 2021-09-05 12:53
    • Resolved 2021-09-05 12:27
  • Time to repair: 5h 16m

  • Time to resolve: 5h 27m

  • Identified: Prometheus Pod restarted several times with error OOMKilled causing Prometheus Healthcheck to go down

  • Impact:

    • The monitoring system of the cluster was not available
    • All application metrics were lost during that time period
  • Context:

  • Resolution:

    • Increased the memory limit for Prometheus container from 25Gi to 50Gi
  • Review actions:

    • Created a ticket to configure Thanos querier to query data for longer period
    • Created a ticket to add an alert to check when prometheus container hit 90% resource limit set
    • Created a ticket to create a grafana dashboard to display queries that take more than 1 minute to complete
    • Increase the memory limit for Prometheus container to 60GiPR #105
    • Test Pagerduty settings for weekends of the Cloud Platform on-call person to receive alarm immediately on the phone when a high priority alert is triggered