Unify Your Incident Management Process With the Fundamentals
In a perfect world, technology stays on and runs flawlessly. But we all know this isn’t the case. Like any organization, xMatters sometimes experiences unplanned incidents. What we can control is how we respond to them. To resolve incidents quickly, it’s important to coordinate an organized response.
Your Incident Management Process Needs to Evolve
xMatters decided long ago that a well-developed incident management process was critical to supporting customers. Over the years, our architecture has evolved from on-premise to hosted data centers, to the cloud. This rapid evolution meant our incident processes and workflows needed to evolve as well.
To streamline our business, we started reviewing our incident management process to align with our architecture. The exercise was designed to build on what we already had, integrate new concepts, and make any necessary adjustments. Here are the steps we took and the lessons we learned in developing the latest version of our incident management workflow process.
Invest Time Upfront
Initially, we found it challenging knowing where to start – incident management isn’t something that can be done in a single meeting. Preparing an incident management plan requires thought, willing partners, collaboration, and the time to build it properly. Fortunately, xMatters understands the importance of a thoughtful, cross-functional plan. Each organization has unique requirements, so it’s important to revisit the fundamentals before getting started. We began by reviewing goals, stakeholders, structure, severity classification, and tools. This high-level audit allowed us to achieve a clear picture of the work needed to update our incident management plan.
Fundamental 1 – Define Your Goals
It may seem obvious, but defining what you want out of your incident management process is the first step:
- Speed – Minimize customer impact by quickly identifying potential problems, bottlenecks, and opportunities to improve response times.
- Efficiency – Consider whether there’s anything in the process preventing the team from hitting the ground running. Does something have to be added or removed from the process?
- Ease – The last thing anyone wants is a convoluted process. Review past incidents to determine what information is commonly needed and what playbooks, documentation, or references are used to help make the process as easy as possible.
- Clarity – Define roles, playbooks, and timelines. The less a team has to organize on the fly, the more likely they are to succeed.
Fundamental 2 – Identify Stakeholders
We looked at various groups that play a role in incident management and resolution. They were polled for their thoughts on what was good and what could improve in the process.
- Incident Initiators – Identify what knowledge and tools teams initiating an incident need. What have been challenges in the past?
- Incident Lead – What does the team organizing the incident need to know? Do they feel comfortable in this role? Are there any blockers?
- Subject Matter Experts – Is there any specific information SMEs need when they become involved in an incident? What environment is best for them?
- Leaders and Third Parties – When do organizational leaders need to be informed? What type of information do they need? Who should be part of these groups? How do we keep the organization informed without impacting the resolution process?
Fundamental 3 – Determine Incident Structure
We use the Incident Command System (ICS) concepts to organize an incident. We quickly found we didn’t need to adopt all of the ICS to be successful. Instead, we took the components that applied to our goals and used them as a guide for our incident management process.
- Roles – This information allowed us to define the roles needed for our incident management and helped us determine who could take them on; and if we needed to incorporate any other functions into the plan.
- Training – This review helped us identify training opportunities for our incident teams.
- Post-Mortem – We also determined which details of an incident were best to capture to complete an effective post-mortem and root cause analysis.
Fundamental 4 – Clarify Severity Terminology
- Structure – Part of building incident management was clearly defining the criteria for each incident type. We reviewed our existing severity structure and adapted it to take SLAs, scope of response, and our customers’ needs into account.
- Classification – It’s important to classify severity early in the incident management planning process, especially when incidents impact customers because it guides the design of the broader plan.
- Teams – Match severity to the required response. It may make more sense to gather a small team with the needed skill set than to incorporate every team member. Sometimes large teams can be noisy and can impact efficiency.
Fundamental 5 – Identify Tools
- Tools – xMatters, Slack, Jira, Zendesk, and our monitoring tools are all part of our incident process. The intention of the review was to understand how each tool fit into an incident and see whether there was an opportunity to improve its use. For example, xMatters is central to connecting our tools and notifying our teams. We reviewed our use of xMatters and identified areas where we could improve.
- Process – All tools went through this process before we reworked our plan, allowing us to see how teams used tools and identify areas for change, improvement, or workflow automation.
Workflows from the real world
Please check out the workflows we use for our customers.