Incident on 2023-08-04 - Dropped logging in kibana
Key events
- First detected: 2023-08-04 09:14
- Incident declared: 2023-08-04 10:09
- Repaired: 2023-08-10 12:28
- Resolved 2023-08-10 14:47
Time to repair: 33h 14m
Time to resolve: 35h 33m
Identified: Users reported in #ask-cloud-platform that they are seeing long periods of missing logs in Kibana.
Impact: The Cloud Platform lose the application logs for a period of time.
Context:
- 2023-08-04 09:14: Users reported in #ask-cloud-platform that they are seeing long periods of missing logs in Kibana.
- 2023-08-04 10:03: Cloud Platform team started investigating the issue and restarted the fluebt-bit pods
- 2023-08-04 10:09: Incident declared. https://mojdt.slack.com/archives/C514ETYJX/p1691140153374179
- 2023-08-04 12:03: Identified that the newer version fluent-bit has changes to the chunk drop strategy
- 2023-08-04 16:00: Team bumped the fluent-bit version to see any improvements
- 2023-08-07 10:30: Team regrouped and discuss troubleshooting steps
- 2023-08-07 12:05: Increased the fluent-bit memory buffer
- 2023-08-08 16:10: Implemented a fix to handle memory buffer overflow
- 2023-08-09 09:00: Merged the fix and deployed in Live
- 2023-08-10 11:42: Implemented to handle flush logs into smaller chunks
- 2023-08-10 12:28: Incident repaired
- 2023-08-10 14:47: Incident resolved
Resolution:
- Team identified that the latest version of fluent-bit has changes to the chunk drop strategy
- Implemented a fix to handle memory buffer overflow by writing to the file system and handling flush logs into smaller chunks
Review actions:
- Push notifications from logging clusters to #lower-priority-alerts #4704
- Add integration test to check that logs are being sent to the logging cluster