Top 4 Tips for Managing On-Call Teams
CategoriesOn Call Scheduling
The use of digital services is at an all-time high. In fact, 90% of consumers are taking advantage of new digital services or applications to do their job as a result of the current stay at home period. For digital service providers, the value of reliability is at an all-time high. For 43% of consumers, the top reason they’d stop using a critical app or website is reliability. As a VP of engineering, former on-call engineer, and also an incident commander (when need be to support my team), I know first-hand how IT teams struggle to manage their issue resolution and incident response.
As managers, it’s our job to make sure engineers can manage incidents and drive reliability. DevOps first emerged in 2008 as a cultural shift to drive faster service delivery by removing barriers between teams. But it didn’t give managers clear success guidance — or explain how to increase responsiveness to market demands. More recently, developers are often doing the job of an ops engineer. But with every engineer eligible to be on-call, setting up a process for on-call teams can be tricky. As we found in a recent industry survey of IT professionals, 50% say capacity planning is among the most difficult and critical operations challenges they face.
Common Challenges for Organizations
- Increased complexity resulting in cognitive overload
The pace of digital services innovation continues to accelerate. To continually deliver a high-quality customer experience, many organizations are rapidly introducing new services, which must be supported. Seventy-seven percent of digital services organizations report their number of releases increased by at least 25% over the past three years.The role of those responsible for positive customer experiences is also evolving and expanding. Developers, who ideally should be focused on delivering innovation, now share in the burden of delivering an uninterrupted customer experience.
- A lack of contextual data for troubleshooting
To quicken resolution time, it takes more than just notifying the right people quickly. Context is critical for taking the appropriate immediate action. At least 25% of IT professionals say they spend more time resolving issues and incidents than producing new features and products. Without context, it takes longer to identify and begin the appropriate remediation process.
- Alert fatigue
A large global enterprise receives hundreds of alerts every day that look just like the ones it receives the day of a breach. The constant stream of alerts can cause engineers to check out, a syndrome commonly known as alert fatigue. Reacting to this unending influx of alerts uses your engineers’ time and resources, costs money, and can prevent your IT department from playing a more strategic role in your company’s success. The complexity of applications distributed across the enterprise prevents your engineers from getting away from the alert stream.
Why On-Call is Important for Managers
Incident response can’t even begin without an effective on-call process and system. Managers need to be able to see that incidents were resolved in a timely manner via service level indicators (SLIs) to meet service level objectives (SLOs). Then in a post-mortem they can identify areas for improvement and learn about which processes to optimize to prevent these occurrences in the future.
4 Steps for On-Call Management Success
During major incidents, stakeholders including customers, executives, partners, and others demand frequent status updates. Part of your major incident response plan needs to focus on communication with these key players. Automating the identities and contact information of your stakeholders is fairly obvious. Manually communicating with stakeholders is way too time-consuming and error-prone, especially during a major incident.To help mitigate the damage of downtime or service degradation, you should also automate the content and frequency of communications during major incidents. Be prepared to answer any stakeholder questions to reassure them that all reasonable actions are being taken to address system weaknesses. Some executives want the nitty gritty details, and others just want the highlights — satisfy both needs with customized messaging sent automatically from initiation until resolution. xMatters uses Smart Notifications so that resolvers receive technical details while business stakeholders get status and impact updates.
2. Assess Group and Individual Performance
Measuring group performance is key to identifying areas that need improving, assessing contributions, and exposing vulnerabilities. Understand if groups and individuals are processing or ignoring issues by assessing the percentage of notifications they responded to. Improve mean time to recovery (MTTR) by measuring the maximum and average time it takes users or groups to respond to an event. Monitoring team performance when fixing issues also enables managers to gauge who needs coaching. xMatters offers an Incident Timeline and performance analytics. You can measure the quality of a user’s response by assigning scores to response types based on how they impact the resolution process. For instance, you might define an escalation to someone else as a negative response when action is required.
3. Monitor how incidents are resolved
During a single high-level incident, messages are sent to the IT/DevOps teams, to executive communications, and to customers. A major incident can include several distinct events. In the middle of the chaos, the crossfire of messages can be confusing. Incident managers can have a difficult time figuring out the exact cause-and-effect dynamic until each event is resolved, and sometimes not even then. The complexity can lengthen the time to resolve the entire incident and limit the ability to improve processes going forward.
The xMatters Event Timeline feature provides a real-time chronological and visual representation of when and how recipients are notified for an event. You can use the timeline during an active event to quickly gain insight into the way it is progressing and to see who is actively working to resolve the issue. After an event, the timeline is also a post-mortem tool to help review how the event was handled.
4. Review the resolution process
A review is a fundamental piece of the incident resolution process, and all relevant parties should attend. The major incident manager and the problem manager should walk the group through the incident record, so they can assess the resolution process together. The review can also identify improvements that can prevent a similar incident from occurring again.
Ticketing systems like Jira Service Desk and Zendesk have become the gold standard for DevOps source of truth. Teams rely heavily on tickets during post mortems so they can identify repeatable processes that can be put in place to help prevent similar incidents in the future. That means tickets should ideally include all incident details, as well as the remediation steps and timeline. But the reality is that while they’re in the middle trying to resolve an incident, few people take the time to create and update tickets. Instead, they often spend extra time post-resolution to manually transfer incident details from multiple systems, which is both inefficient and prone to error.
That’s why bi-directional integration is paramount. When integrated with xMatters, Jira steps become part of an automated toolchain process, allowing xMatters to create tickets. xMatters automatically appends and updates incident management systems – from chat to service desk to monitoring – with the incident information and steps taken to resolve, so development teams have easy reference to this during the post-mortem meeting. Furthermore, as xMatters retains historical incident data, you can quickly cross-reference similar incidents to identify any patterns of issues to better prevent future incidents.
Simplify On-Call for Faster Response
Centralizing on-call scheduling and automating notifications are a critical to reducing incident response times. The demand on digital service in the midst of the novel coronavirus is anticipated to continue well beyond containment of the virus. More specifically, in our study, Impact of COVID-19 on Digital Transformation, we found that 82% of consumers will continue to use websites or mobile applications to complete tasks in the same capacity after the current stay at home period is lifted.
xMatters is already built for the managing on-call teams success successfully. Our integration platform automates the process of sharing data between tools and identifying the on-call team members to manage the resolution process. Actionable alerts enable team members to do work without toggling between systems, saving valuable time and easily engaging other team members while also preserving conversations and other information. These pieces of information keep customers, team members, and executives informed of progress, keep relevant information in one place for ease of use, and leave a repository of data for the post mortem.
Transforming your digital services starts with leveling up your on-call practices with xMatters. Get started with xMatters for free, today.