This document describes our incident handling process.
1. Confirm that the event constitutes an incident
If this is a security breach, also report it here
We define an incident as an event which:
- is unplanned, and
- impacts end users, or
- degrades user-facing services, or
- increases risk to production services
“Users” includes end-users of services (citizens or members of internal user groups such as prison officers), as well as users of the platform - i.e. members of service teams who maintain or depend on services we host/maintain.
An example of increased risk might be when one or more members of a high-availability set of components stops working. e.g. if one of our three master nodes in live-1 stopped working, it would not have any visible effect to end users but the cluster would be at increased risk, because we would no longer have a highly-available cluster.
If this event does not constitute an incident, the appropriate response is probably to raise a ticket to fix whatever needs fixing.
Once you are confident that you have an incident, declare it as such.
2. Declare the incident
Post a message into the
#cloud-platform slack channel, following this format:
I am declaring an incident because [reason]. The impact is that [impact].
This declaration, by any member of the cloud platform team, marks the official start of the incident.
It is important to use this exact form of words so that we can use scripts to pull useful data out of the channel logs, later on.
This message should be used as the start of a slack thread which will form a key source of information during and after the incident (the incident thread).
At your discretion, you may also wish to notify users now via
#cloud-platform-update, however you may prefer to leave this until later in the process, when more information is available.
I am declaring an incident because one of our master nodes is in a failed state. The impact is that the cluster is at increased risk of failure. I am declaring an incident because Prometheus is down. The impact is that service teams are not getting any alerts or metrics data for their services. I am declaring an incident because live-1 is down. The impact is that all production and non-production services are down.
3. Assign roles
The two roles which must be filled for every incident are the Incident Lead and the Scribe.
In rare cases, the same person might fill both roles, but this is discouraged because it generally leads to poor record keeping.
To fill these roles, ask for volunteers from the team, either verbally or via #cloud-platform. In the unlikely event that you don’t get any volunteers, appoint someone.
3.1 Incident Lead
- coordinate our response to the incident
- decide on any additional roles required (e.g. a communications lead may be required)
- ensure that all required roles are filled
- ensure that all tasks which need to be handled are being done
- make the final decision whenever we need to choose a course of action
- set the schedule for any regular team check-ins, if those are deemed necessary
- declare the incident closed, when appropriate
- ensure that the post-incident process is followed
NB: The incident lead needs to ensure that things are being done, not try to do everything themselves
Once appointed, the incident lead should post this message on the incident slack thread:
I am the incident lead
Again, if you use these words, it will make it easier to scrape data from the thread later.
The scribe is responsible for keeping a log of the incident, including:
- important events
- discussion topics
- results of actions/investigations
NB: this log is not intended to be a verbatim transcript of discussions. Rather, things like “xxx suggested the disk might be full. yyy to investigate and report back”
Once appointed, the scribe should post this message on the incident slack thread:
I am the scribe
The form of the log is at the scribe’s discretion, provided key events are timestamped, and that it can easily be handed off to another member of the team if they take over as scribe.
3.3 Communications Lead
Depending on the impact/duration/panic-level of an incident, it may be desirable to appoint a communications lead.
- communicate the incident out to those who are impacted
- decide what, how and how often to issue updates
- give updates at regular intervals
- field enquiries so the team can focus on fixing the incident without interruptions
- disseminate a post-incident report
People to update:
- Users who are affected by the problem - via #ask-cloud-platform, #cloud-platform-update and the affected team’s own slack channels.
- Team members for awareness or because they might be able to help - via #cloud-platform
- People in the team who manage communication with senior leadership in MoJ - Steve, Karen, Tony.
In the case of high-priority user-impacting incidents there is a need to keep the MoJ Incident Management Team aware. This is done by posting updates in the private #p1s slack channel (only Steve and Tony can do this), and via email to MoJdtincidentmanagement@justice.gov.uk
It may be necessary to transfer roles from one team member to another, e.g. during long-running incidents. In this case, it is the responsibility of whoever is in a role to ensure that someone else takes it over.
Whoever assumes a role should announce it in the incident slack thread, so that the team is aware.
4. Fix the problem
Please bear in mind that not every incident requires the whole team to be involved (even if they all want to join in).
5. End the incident
When work on the problem has ceased, either because the problem has been resolved or because resolution is blocked until a later date, the incident lead should post a message to the incident thread as follows:
I am declaring this incident resolved because [resolution / plan]
This marks the official end of the incident. If users were notified of the incident, an appropriate message should go out via the same channels to tell them it’s over.
6. Post-incident procedure
After the incident is resolved: