Skip to main content

Incident on 2024-07-25 Elasticsearch no longer receiving logs

  • Key events

    • First detected: 2024-07-25 12:10
    • Incident declared: 2024-07-25 14:54
    • Repaired declared: 2024-07-25 15:18
    • Resolved 2024-07-25 16:19
  • Time to repair: 3h 8m

  • Time to resolve: 4h 9m

  • Identified: User reported that Elasticsearch was no longer receiving logs

  • Impact: Elasticsearch and Opensearch did not recieve logs, this meant that we lost users logs for the period of the incident. These logs have not been recovered.

  • Context:

    • 2024-07-25 12:10: cp-live-app-logs - ClusterIndexWritesBlocked starts
    • 2024-07-25 12:30: cp-live-app-logs - ClusterIndexWritesBlocked recovers
    • 2024-07-25 12:50: cp-live-app-logs - ClusterIndexWritesBlocked recovers
    • 2024-07-25 12:35: cp-live-app-logs - ClusterIndexWritesBlocked starts
    • 2024-07-25 12:55: cp-live-app-logs - ClusterIndexWritesBlocked starts
    • 2024-07-25 13:15: cp-live-app-logs - ClusterIndexWritesBlocked recovers and starts
    • 2024-07-25 13:40: cp-live-app-logs - ClusterIndexWritesBlocked recovers and starts
    • 2024-07-25 13:45: Kibana no longer receiving any logs
    • 2024-07-25 14:27: User notifies team via #ask-cloud-platform that Kibana has not been receiving logs since 13:45.
    • 2024-07-25 14:32: Initial investigation shows no problems in live monitoring namespace
    • 2024-07-25 14:42: Google meet call started to triage
    • 2024-07-25 14:54: Incident declared
    • 2024-07-25 14:55: Logs from fluent-bit containers show “could not enqueue into the ring buffer”
    • 2024-07-25 14:59: rollout restart of all fluent-bit containers, logs partially start flowing but after a few minutes show the same error message
    • 2024-07-25 15:18: It is noted that Opensearch is out of disk space, this is increased from 8000 to 12000
    • 2024-07-25 15:58: Disk space increase is complete and we start seeing fluent-bit processing logs
    • 2024-07-25 16:15: Remediation tasks are defined and started to action
    • 2024-07-25 16:19: Incident declared resolved
  • Resolution:

    • Opensearch disk space is increased from 8000 to 12000
    • Fluentbit is configured to not log to Opensearch as a temporary measure whilst follow-up investigation work into root cause is carried out.
  • Review actions: