Major Incident Responses Lack Consistency
While DevOps support practices necessitate high levels of both structure and guidance, teams in charge of major incident responses aren’t often given the structure or the autonomy to make their own quick decisions, which leads to diminished consistency.
According to a 2017 DevOps survey from xMatters and Atlassian, most modern companies do not have consistent responses to major incidents.
In fact, 48 percent say their tools, process, and steps vary from incident to incident, while 29 percent say duplicate tickets are created while the incident is being resolved. Finally, 23 percent say tickets are routed without proper assignments and must often be rerouted.
This is a problem that can easily create inconsistencies and negative variations in incident response teams.
The Root of the Sub-Par Consistency Problem
One of the major factors in this lack of consistency is minimal autonomy. According to the same DevOps survey, 50 percent of respondents wait for the operations center to declare a major incident before responding, while 34 percent say waiting for subject matter experts delays incident resolution.
This is the result of many factors. Many experts, however, believe it owes primarily to the fact that the “fail hard and fail fast” motto associated with DevOps is unacceptable when applied to customer downtime.
It may also have quite a lot to do with the fact that the inherent chaos of a major incident prevents experimentation.
How to Create More Consistency During Incident Response
While the inconsistency in the modern incident response model is unacceptable, making a plan to fix it is somewhat more complex.
Currently, I believe that the best approach would be to execute experimentation during drills and to initiate internal test responses. Over time, this would lead to better, more agile responses during actual major incidents.
Collaboration software can help support this experimentation and troubleshoot internal test responses.
Automation can also play a large role in increasing the consistency of incident response. Workflow-based systems designed to offer full visibility and ongoing feeds can work to provide information needed to make consistent decisions and develop a predictable process for containment and remediation.
In this way, automation actually stands out as one of the most efficient and time-efficient measures for incident management and can help thin-stretched teams develop more consistency throughout their incident management processes.
Automation and workflow-based systems also solve one of the major issues outlined in the xMatters-Atlassian Report, which is that tools, processes and steps vary from incident to incident. When automation is adopted, processes become more streamlined and predictable, as do tools and steps.
- Predictive Platforms
It’s also essential for incident response teams to develop predictive methods that declare major incidents earlier, which allows teams to respond faster and access subject matter experts more quickly.
IT alerting platforms and system integrations are both essential for a more predictive system and advanced responses.
More Consistent Incident Response Starts Here
While cultivating a consistent incident response schedule can be complicated, simple tactics like implementing automation, using predictive technology and promoting experimentation and internal test responses can go a long way toward restoring autonomy, creating consistent steps and processes and promoting better incident response experiences for both teams and customers across the board.