What Makes a Perfect Incident Management Checklist? We Asked The Experts!
The perfect incident management checklist doesn’t need to be a fantasy. In fact, it shouldn’t be! The perfect incident management checklist should cover a variety of topics, broken down into bite-size sections so that different team members can claim responsibility for their own sections.
We asked our internal and external experts what they think should be included in the perfect incident management checklist and unsurprisingly, we were blown away by some of their response.
Don’t Overlook the Basics
An incident management checklist should always start at square one, and for most users that’s an internet connection. Let’s see what other basics should be included, as highlighted by Jem Bezooyen, Senior Frontend Developer.
- A good headset
- A decent network connection
- Proper authentication allowances so all responders can sign in to major services and dashboards
- Awareness of what to check and where those assets are located (eg. links to logs/dashboards/playbooks)
- Scripts or dashboards accessible that can quickly show high-level status
- Playbooks that define what to do in certain circumstances
Clear Roles and Responsibilities
In the midst of an urgent incident or crisis, you don’t want to be spending time trying to decide who should be responsible for certain tasks. An incident management checklist that outlines roles and responsibilities from the beginning can be a huge time saver, but what else should be in this section? Well, Curtis St. Pierre, Engineering Team Lead, knows.
- The inclusion of a stakeholder on-call who has the authority to make necessary choices to ensure resolution is possible
- The inclusion of communication lead to prepare internal and external messaging
- The inclusion of on-call subject matter experts, no matter the incident
Actionable To-Do Items
Once the administrative work of ensuring the basics are covered and the right people are in the right seats, it’s time to get to the to-dos. Specific actions may be dependent on the incident itself, but almost every incident management checklist requires the following actions, outlined by Tim Thompson, Team Lead and Senior UI Developer
- Assign roles
- Bring in the needed experts and stakeholders
- Communicate the status to the customer
- Ask standard failover questions at certain timeframes (eg. 15 min in should we fail over?)
- Summarize the incident after issue mitigated
- Schedule a postmortem
The Specific Specifics
Whether it’s an attachment to your incident management checklist or page two, the during-incident specifics need to be considered. This includes note-taking, root cause identification, and so much more. Take it away, Gokce Yalcin!
- Date and time of the incident is logged
- First responders are logged
- Systems at fault are identified
- The severity of incident is identified
- The blast radius of the incident is identified
- The resolution date and time is estimated
- Communication rollout is planned
- Determine if the incident was repetitive, if yes, refer to previous incidents
- Determine if multiple incident reports were present, group if any
- A current snapshot of health check systems, relevant monitoring and logging systems
- Identify possible playbooks that can revert systems
- On code and deployment issues, decide rollback vs push forward on resolution strategy, log decision, reference to fix commits or ci/cd builds
- Timeline of events starting from the discovery of incident to resolution
- Postmortem is planned, date and time is decided, meeting minutes and action items are attached from the incident postmortem to take preventative steps for future
- Update service track record with the attached incident, set incident without days to zero
- Update stakeholders as per communication rollout plan
Incident management checklists are always a work in progress, and as organizations experience more incidents of varying degrees, teams learn what should and shouldn’t be part of the response process. Something that always should be part of the process though, is having a service reliability platform that can help you automate your response, integrate your tool stack, and accelerate your entire incident management process. And, we think we know the tool for you!