Unify Your Incident Management Process With the Fundamentals
This is the first in a series of three articles about defining your incident management process.
In a perfect world technology stays on and runs flawlessly. But we all know this isn’t the case. Like any organization, xMatters sometimes experiences unplanned incidents. What we can control is how we respond to them. To resolve incidents quickly, it’s important to coordinate an organized response.
Your incident management process needs to evolve
xMatters decided long ago that a well-developed incident management process was critical to support our customers. Over the years our architecture has evolved from on-premise to hosted data centers, to cloud. This rapid evolution meant that our incident processes and workflows needed to evolve as well.
About eight months ago I started reviewing our incident management process to align with our architecture. The exercise was essentially to build on what we already had, to integrate new concepts, and make any necessary adjustments. Here, I’ll share the steps we took and the lessons we learned developing the latest version of our incident management workflow process.
Invest time up front
I found it challenging to know where to start – incident management isn’t something that can be done in a single meeting. Preparing an incident management plan needs thought, willing partners, collaboration, and the time to build it properly. Fortunately, senior leadership at xMatters understands the importance of a thoughtful, cross-functional plan. Each organization has unique requirements, so it’s important to revisit the fundamentals before getting started. I began by reviewing goals, stakeholders, structure, severity classification, and tools. This high-level audit allowed me to have a clear picture of the work needed to update the plan.
Fundamental 1 – Define Your Goals
It may seem obvious, but defining what you want out of your incident management process is the first step:
- Speed – Minimize customer impact by quickly identifying potential problems, bottlenecks, and opportunities to improve response times.
- Efficiency – Consider whether there’s anything in the process that prevents the team from hitting the ground running. Does something have to be added or removed from the process?
- Ease – The last thing anyone wants at 3am is a convoluted process, so review past incidents to determine what information is commonly needed, and what playbooks, documentation, or references are used to help make the process as easy as possible. (Our aim was to free technical teams from the toil of the incident resolution process so they could focus on incident management itself. For example, we looked at how to automate notifying subject matter experts (SMEs) and providing stakeholder updates, minimizing the documentation burden, and ensuring relevant playbooks were clear and easily accessible.)
- Clarity – Define roles, playbooks, and timelines. The less a team has to organize on the fly, the more likely they are to succeed.
Fundamental 2 – Identify Stakeholders
We looked at the various groups that play a role in incident management and resolution. They were polled for their thoughts on what was good and what could improve in the process.
- Incident Initiators – Identify what knowledge and tools teams that initiate an incident need. What have been the challenges in the past?
- Incident Lead – What does the team that organizes the incident need to know? Do they feel comfortable in this role? Are there any blockers?
- Subject Matter Experts – Is there any special information that SMEs need when they become involved in an incident? What environment is best for them?
- Leaders and Third Parties – When do organizational leaders need to be informed and what type of information do they need? Who should be part of these groups? How do we keep the organization informed without impacting the resolution process?
Fundamental 3 – Determine Incident Structure
We use the concepts from Incident Command System (ICS) to organize an incident. We quickly found that we didn’t necessarily need to adopt all of ICS; in practice, we took the components that applied to our goals and used them as a guide for our incident management process.
- Roles – This information allowed us to define the roles needed for our incident management. It helped us determine what roles we needed and who could take on the incident commander role, who was best to scribe, and if we needed to incorporate other functions into the plan.
- Training – This review helped us identify training opportunities for our incident teams.
- Post-Mortem – We also determined which details of an incident were best to capture to complete an effective post-mortem and root cause analysis.
Fundamental 4 – Clarify Severity Terminology
- Structure – Part of building incident management was to clearly define the criteria for each incident type. We reviewed our existing severity structure and adapted it to take into account SLAs, scope of response, and our customers’ needs.
- Classification – It’s important to classify severity early in the incident management planning process, especially when incidents impact customers, because it guides the design of the broader plan.
- Teams – We were now able to match the severity to the required response. It doesn’t make sense to gather a full team if the incident can be managed by a smaller targeted team with the right skill set. Sometimes large teams can be noisy and can impact efficiency.
Fundamental 5 – Identify Tools
- Tools – xMatters, Slack, Jira, Zendesk, and our monitoring tools are all part of our incident process. The purpose of the review was to understand how each tool fit into an incident and to see whether there was an opportunity to improve its use. For example, xMatters is central to connecting our tools and notifying our teams. We reviewed our use of xMatters and identified areas where we could improve.
- Process – All tools went through this process before we reworked our plan, allowing us to see how teams used tools and to identify areas for change, improvement, or workflow automation.
Once armed with what we learned from these fundamentals, the real fun began. We’ll save that story for the next post.
Workflows from the real world
Please check out workflows we use for our customers. We’ll be adding more going forward.