Skip to main content

Incident on 2020-02-18

  • Key events

    • Fault occurs 2020-02-18 14:13
    • Incident declared 2020-02-18 14:23
    • Resolved 2020-02-18 14:59
  • Time to repair: 0h 36m

  • Time to resolve: 0h 46m

  • Identified: Pingdom reported that Prometheus was down (prometheus.cloud-platform.service.justice.gov.uk).

  • Impact:

    • The prometheus dashboard was unavailable for everyone, for the whole duration of the incident.
    • Between 2020-02-18 14:22 and 2020-02-18 14:26, prometheus could not receive metrics.
  • Context:

    • Although the Prometheus URL was unreachable, Grafana and Alertmanager were resolving.
    • There seemed to be an issue preventing requests to reach the prometheus pods.
    • Disk space and other resources, the usual suspects, were ruled out as the cause.
    • The domain name amd ingress were both valid.
    • Slack thread:
  • Resolution: We suspect an intermittent & external networking issue to be the cause of this outage.