Incident on 2021-09-04 - Pingdom check Prometheus Cloud-Platform - Healthcheck is DOWN
Key events
- First detected 2021-09-04 22:05
- Repaired 2021-09-05 12:16
- Incident declared 2021-09-05 12:53
- Resolved 2021-09-05 12:27
Time to repair: 5h 16m
Time to resolve: 5h 27m
Identified: Prometheus Pod restarted several times with error
OOMKilled
causing Prometheus Healthcheck to go downImpact:
- The monitoring system of the cluster was not available
- All application metrics were lost during that time period
Context:
- Timeline: Timeline for the incident
- Slack thread: Slack thread for the incident.
Resolution:
- Increased the memory limit for Prometheus container from 25Gi to 50Gi
Review actions:
- Created a ticket to configure Thanos querier to query data for longer period
- Created a ticket to add an alert to check when prometheus container hit 90% resource limit set
- Created a ticket to create a grafana dashboard to display queries that take more than 1 minute to complete
- Increase the memory limit for Prometheus container to 60GiPR #105
- Test Pagerduty settings for weekends of the Cloud Platform on-call person to receive alarm immediately on the phone when a high priority alert is triggered