Incident Management Needs its Own Transformation
We live, learn and work in a technology-focused world where the speed and quality of the digital customer experience determine business success. No “aha!” moments there.
Before COVID-19, we saw companies pursuing digital transformation at varying velocities according to market demands, business resources and internal strategy. The pandemic has been an accelerant igniting the launch of new digital services we now rely on — and vaporizing underperformers.
For some companies, digital transformation went from whiteboard to production in weeks. In the rush to deliver digital customer experiences, companies continue to invest heavily to push applications and services faster. Often the race to build, change and deploy features adds new customer-impacting issues faster than they can be fixed. Our customers tell us that this pace is unsustainable.
Do we deploy more features or do we fix the issues on the services we already have? Digital service resilience is the ability to satisfy service performance expectations and delight customers with experiences that keep them coming back for more.
DevOps and SRE teams responsible for maintaining digital services are forced to juggle expectations for constant innovation with demands to quickly address and resolve service degradations. In our latest survey, we found that over 70% of technology teams spend as much time managing incidents and fixing issues as they do on innovating. More than 25% of this group spends more than 80% of their time on incidents. Woof!
And there lies the paradox: we’ve accelerated the deployment of our services, but we’re spending too much time fixing them, leaving less time for innovation. We either need to create more time in the day or we need a new approach. Hey, we might be onto something here!
A New Recipe for Incident Management
ITIL and IT service management emerged from a different set of assumptions back when waterfall was cool. Over the last 20 years, innovation in incident management has moved the help desk application from on-premises to software as a service. We can do better but it takes a different approach.
Our teams went to work on this problem years ago. We’ve envisioned a new way of approaching incident response automation and incident management. We imagined what the world would look like if we applied agile principles to traditional incident management and spiced it up with a dash of ‘SRE flavor.’ Our two main chefs were xMatters CTO Tobias Dunn-Krahn and CPO Doug Peete. You might say they came up with a new recipe for a manifesto that reads like this:
- Collaboration over process and planning
- Team autonomy over policies and standards
- Automation over documentation
- Continuous improvement over accountability
That is, while there is value in the items on the right, we value the items on the left more.
We let this new recipe guide our research and product development. We invested thousands of of hours on research, focus groups, observation, and taste tests (a.k.a., mockups and shoulder surfs)! And the results? A delightful dish worthy of appreciation and ready to be served to site reliability engineers, software developers, DevOps practitioners, operations engineers, software architects and the digital transformation elite!
Today, we introduce our vision for adaptive incident management (and I pledge to stop stretching this cooking metaphor for the rest of this blog).
From Resolving Incidents to Continuous Improvement
Adaptive incident management reduces friction in the entire continuous software development cycle, enabling better customer experiences at a lower cost. It serves as the foundation for digital service resilience.
Adaptive incident management automates and guides remediation, enables teams to scale up or down and collaborate across different groups, cultures, tools, processes and systems to resolve an incident in real-time.
Imagine a future where we can lean on technology to automatically find issues and incidents, fix them prior to customers ever knowing, so technology teams can spend more time innovating. To get there we need to get out of the firefight and shift left. The best next step toward digital service resilience is an adaptive approach to incident management.