Skip to main content

Incident on 2023-08-04 - Dropped logging in kibana

  • Key events

    • First detected: 2023-08-04 09:14
    • Incident declared: 2023-08-04 10:09
    • Repaired: 2023-08-10 12:28
    • Resolved 2023-08-10 14:47
  • Time to repair: 33h 14m

  • Time to resolve: 35h 33m

  • Identified: Users reported in #ask-cloud-platform that they are seeing long periods of missing logs in Kibana.

  • Impact: The Cloud Platform lose the application logs for a period of time.

  • Context:

    • 2023-08-04 09:14: Users reported in #ask-cloud-platform that they are seeing long periods of missing logs in Kibana.
    • 2023-08-04 10:03: Cloud Platform team started investigating the issue and restarted the fluebt-bit pods
    • 2023-08-04 10:09: Incident declared. https://mojdt.slack.com/archives/C514ETYJX/p1691140153374179
    • 2023-08-04 12:03: Identified that the newer version fluent-bit has changes to the chunk drop strategy
    • 2023-08-04 16:00: Team bumped the fluent-bit version to see any improvements
    • 2023-08-07 10:30: Team regrouped and discuss troubleshooting steps
    • 2023-08-07 12:05: Increased the fluent-bit memory buffer
    • 2023-08-08 16:10: Implemented a fix to handle memory buffer overflow
    • 2023-08-09 09:00: Merged the fix and deployed in Live
    • 2023-08-10 11:42: Implemented to handle flush logs into smaller chunks
    • 2023-08-10 12:28: Incident repaired
    • 2023-08-10 14:47: Incident resolved
  • Resolution:

    • Team identified that the latest version of fluent-bit has changes to the chunk drop strategy
    • Implemented a fix to handle memory buffer overflow by writing to the file system and handling flush logs into smaller chunks
  • Review actions:

    • Push notifications from logging clusters to #lower-priority-alerts #4704
    • Add integration test to check that logs are being sent to the logging cluster