6 Signs Your Incident Response Steps Are Working
Although IT incidents have always been a concern, the increase in customer-facing technology adds the cost of a bad customer experience to the cost of responding to and remediating an incident. While in a perfect world, you’d be able to prevent incidents from happening in the first place, the reality is they do happen and more often than most of us would like to admit.
A more realistic goal is to evaluate and improve your incident response steps, so that when an incident does happen it can be resolved quickly and effectively. This includes determining what measures allow you to assess the effectiveness of your incident response processes and pinpointing the key performance indicators (KPIs) of those measures. You need to understand what you’re doing today and make improvements for a better incident response tomorrow.
6 Ways to Evaluate Your Incident Response Steps
Incident response is a complex process involving multiple areas of an organization. Incident response steps need to address development and testing process flaws, as well as configuration, database, and network issues. Customer service and investor relations teams may also need to handle concerns that negatively affect customers.
To completely understand if your incident response steps are working, you need to consider several aspects of your plan. Some steps indicate how the overall system is working, while others enable you to thoroughly explore your incident monitoring, communications, and development processes.
1. Incidents Over Time
This might seem like a given but tracking the number of incidents over time is the most fundamental way to evaluate your incident response processes. Are incidents occurring more frequently or less? Is the number of incidents acceptable, or can it be lower?
If this number begins to trend upward, teams should investigate why this is happening and be proactive to resolve the issue. You can make this measure more impactful by classifying your incidents with well-defined severity levels. Are there more incidents with low impact, or are there fewer incidents with high impact? Understanding incident severity levels enables teams to identify and prioritize issues for faster resolution.
2. Calculate Your Incident Response Metrics
Several metrics focus on incident response effectiveness, including mean time to detect (MTTD) and mean time to acknowledge (MTTA). These measures enable you to gauge the effectiveness of your systems and processes that employees use to respond to an incident.
Representing how much time passes between the start of an incident and its detection, MTTD allows you to track how long it takes to identify incidents. MTTA represents the time that passes between a system alert and the actual incident response. Together, MTTD and MTTA help you evaluate how quickly and efficiently you recognize problems and begin to act.
Mean time to resolution (MTTR) is also a vital measurement of how long it takes to resolve an incident after it’s reported. MTTR is the total time that the system remains compromised, including the time it took to detect and acknowledge the incident. MTTR impacts your bottom line because, in addition to the cost of resources to fix the problem, it affects customers trying to access the system.
While MTTR tells you the average time it takes to recover, the mean time between failures (MTBF) helps you understand how often these failures happen. MTBF works hand-in-hand with MTTR to help you understand the impact of incidents. For example, incidents may happen frequently but are easy to fix, or the reverse could be true.
MTTD, MTTA, MTTR and MTBF provide you with overall measures of how well you respond to incidents and get your capabilities back on track. However, other measures provide you with additional information on how well your processes are performing and how to prevent incidents in the first place.
3. Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
While service level agreements (SLAs) are relevant to the organization as a whole and its customers, service level objectives (SLOs) and service level indicators (SLIs) are what concerns engineering teams, as those measures are what they have to meet and track to help ensure the SLAs are not being violated.
The importance of SLOs and SLIs trickle into the postmortem process too. At that stage, teams should be monitoring and ensuring that their SLOs and SLIs are getting reviewed and determine if they need to be adjusted or new SLOs and SLIs need to be added. If incidents continue to happen caused by the same types of issues and your SLIs aren’t catching them first, then there is likely a disconnect between your SLOs and customer expectations.
Keeping a close eye on the status of your SLOs and SLIs is vital not only for legal reasons and customer satisfaction but to understand if your team is monitoring the right things and can respond to issues before customer-impacting incidents occur.
4. Appraise the Cost Per Incident
The cost per incident (CPI) includes several factors, some of which are easier to calculate than others. It’s relatively easy to calculate the cost of employee wages in overtime while they work to recover from an incident. Calculating the cost of losing the service is more challenging, as this loss takes on a few forms.
For example, customers may lose confidence in vendors with faulty systems. This loss of confidence could result in lost sales or lost customers. And, if customers—or former customers—take their dissatisfaction to social media or incident details are publicized, this negative publicity could prevent future customers from considering the vendor.
Opportunity cost is also part of the cost per incident. Incident response takes time away from the primary mission of the personnel responding to the incident. When developers spend time fixing issues, they aren’t spending time building additional products and features. There’s an opportunity cost in releasing services at a slower rate as teams spend their time responding to incidents.
Evaluating the total cost of an incident or the average value of each incident ticket is vital for understanding how your incident response strategy affects the business’s bottom line.
5. Inspect the Escalation Rate
Escalation rates track how often responders need to escalate the issue to higher-level team members. Excessive escalation slows incident response time. It also calls into question the incident response steps. Are response teams triaging incidents correctly? Or are people or systems incorrectly identifying or communicating incident characteristics, prompting the wrong person to respond?
You need to pay attention to escalation rates because each escalation slows the response process and adds extra cost. You may need to take action like assigning team members with more appropriate skills to particular incidents.
6. Examine Incident Reoccurrence
Finally, you should examine the recurring incidents metric to evaluate your incident response effectiveness. Effective incident response shouldn’t resemble a game of Whack-a-Mole. You want to analyze the incident’s root cause and prevent it from happening again.
An effective postmortem identifies the root causes of incidents and provides ways to avoid them in the future. The postmortem team needs incident details to conduct deep analysis. Postmortems should focus on improving processes rather than finger-pointing.
Efficient incident response minimizes the number of recurring incidents. If similar incidents continue to recur, it’s usually a strong indicator of ineffective repair and recovery or a lack of postmortem follow-up to address the incident’s root cause.
How to Improve Your Incident Response Steps
Incident response metrics provide essential tools to monitor the effectiveness of your incident response processes. These six measures provide you with multiple views of these practices so you can identify areas to improve.
But to improve your incident response steps, you need tools that meet your needs. Some tools automate metrics calculation, while automated incident response tools minimize the response time. Also, you need tools that provide information, such as chat summaries, to improve your postmortems and mitigate similar incidents in the future.
Service reliability platforms like xMatters help DevOps, service reliability engineers (SREs), and operations teams automate workflows, ensure infrastructure and applications are always working, and rapidly deliver products at scale. Explore xMatters to learn how to automate your incident response steps to help achieve a superior customer experience.