Uptime Blog

BACK TO BLOG

Top 4 Tips for Managing On-Call Teams

Top 5 Tips for Managing On-Call Teams
Nick Fletcher ON Jun 11, 2020

In my last blog, I gave some tips for being on-call without losing your mind. In this article I’ll discuss managers’ responsibility in managing on-call teams.

The use of digital services is at an all-time high. In fact, 90% of consumers are taking advantage of new digital services or applications to do their job as a result of the current stay at home period. For digital service providers, the value of reliability is at an all-time high. For 43% of consumers, the top reason they’d stop using a critical app or website is reliability. As a VP of engineering, former on-call engineer, and also an incident commander (when need be to support my team), I know first-hand how IT teams struggle to manage their issue resolution and incident response.

As managers, it’s our job to make sure engineers can manage incidents and drive reliability. DevOps first emerged in 2008 as a cultural shift to drive faster service delivery by removing barriers between teams. But it didn’t give managers clear success guidance — or explain how to increase responsiveness to market demands. More recently, developers are often doing the job of an ops engineer. But with every engineer eligible to be on-call, setting up a process for on-call teams can be tricky. As we found in a recent industry survey of IT professionals, 50% say capacity planning is among the most difficult and critical operations challenges they face.

Common Challenges for Organizations

  1. Increased complexity resulting in cognitive overload
    The pace of digital services innovation continues to accelerate. To continually deliver a high-quality customer experience, many organizations are rapidly introducing new services, which must be supported. Seventy-seven percent of digital services organizations report their number of releases increased by at least 25% over the past three years.The role of those responsible for positive customer experiences is also evolving and expanding. Developers, who ideally should be focused on delivering innovation, now share in the burden of delivering an uninterrupted customer experience.
  2. A lack of contextual data for troubleshooting
    To quicken resolution time, it takes more than just notifying the right people quickly. Context is critical for taking the appropriate immediate action. At least 25% of IT professionals say they spend more time resolving issues and incidents than producing new features and products. Without context, it takes longer to identify and begin the appropriate remediation process.
  3. Alert fatigue
    A large global enterprise receives hundreds of alerts every day that look just like the ones it receives the day of a breach. The constant stream of alerts can cause engineers to check out, a syndrome commonly known as alert fatigue. Reacting to this unending influx of alerts uses your engineers’ time and resources, costs money, and can prevent your IT department from playing a more strategic role in your company’s success. The complexity of applications distributed across the enterprise prevents your engineers from getting away from the alert stream.

Why On-Call is Important for Managers

Incident response can’t even begin without an effective on-call process and system. Managers need to be able to see that incidents were resolved in a timely manner via service level indicators (SLIs) to meet service level objectives (SLOs). Then in a post-mortem they can identify areas for improvement and learn about which processes to optimize to prevent these occurrences in the future.

4 Steps for On-Call Management Success

  1. Send Stakeholder Notifications
    A mobile stakeholder alert lets people know what the issue is, and where they can join the conversation

    A mobile stakeholder alert lets people know what the issue is, and where they can join the conversation

    During major incidents, stakeholders including customers, executives, partners, and others demand frequent status updates. Part of your major incident response plan needs to focus on communication with these key players. Automating the identities and contact information of your stakeholders is fairly obvious. Manually communicating with stakeholders is way too time-consuming and error-prone, especially during a major incident.To help mitigate the damage of downtime or service degradation, you should also automate the content and frequency of communications during major incidents. Be prepared to answer any stakeholder questions to reassure them that all reasonable actions are being taken to address system weaknesses. Some executives want the nitty gritty details, and others just want the highlights — satisfy both needs with customized messaging sent automatically from initiation until resolution. xMatters uses Smart Notifications so that resolvers receive technical details while business stakeholders get status and impact updates.

  2. Assess Group and Individual Performance
    Measuring group performance is key to identifying areas that need improving, assessing contributions, and exposing vulnerabilities. Understand if groups and individuals are processing or ignoring issues by assessing the percentage of notifications they responded to. Improve mean time to recovery (MTTR) by measuring the maximum and average time it takes users or groups to respond to an event. Monitoring team performance when fixing issues also enables managers to gauge who needs coaching. xMatters offers an Incident Timeline and performance analytics. You can measure the quality of a user’s response by assigning scores to response types based on how they impact the resolution process. For instance, you might define an escalation to someone else as a negative response when action is required.
View Group Performance analytics to learn how an issue was resolved to inform future process improvements

View Group Performance analytics to learn how an issue was resolved to inform future process improvements

  1. Monitor how incidents are resolved
    During a single high-level incident, messages are sent to the IT/DevOps teams, to executive communications, and to customers. A major incident can include several distinct events. In the middle of the chaos, the crossfire of messages can be confusing. Incident managers can have a difficult time figuring out the exact cause-and-effect dynamic until each event is resolved, and sometimes not even then. The complexity can lengthen the time to resolve the entire incident and limit the ability to improve processes going forward. The xMatters Event Timeline feature provides a real-time chronological and visual representation of when and how recipients are notified for an event. You can use the timeline during an active event to quickly gain insight into the way it is progressing and to see who is actively working to resolve the issue. After an event, the timeline is also a post-mortem tool to help review how the event was handled.
An incident timeline shows resources, notifications, and their response times

An incident timeline shows resources, notifications, and their response times

  1. Review the resolution process
    A large global enterprise receives hundreds of alerts every day that look just like the ones it receives the day of a breach. The constant stream of alerts can cause engineers to check out, a syndrome commonly known as alert fatigue. Reacting to this unending influx of alerts uses your engineers’ time and resources, costs money, and can prevent your IT department from playing a more strategic role in your company’s success. The complexity of applications distributed across the enterprise prevents your engineers from getting away from the alert stream.
xMatters appends the Jira issue with comments per the resolution process

xMatters appends the Jira issue with comments per the resolution process

Managing on-call teams made simple

On-Call Scheduling that’s simple, automated, and self-maintaining: xMatters On-Call Scheduling

Simplify On-Call for Faster Response

Centralizing on-call scheduling and automating notifications are a critical to reducing incident response times. The demand on digital service in the midst of the novel coronavirus is anticipated to continue well beyond containment of the virus. More specifically, in our study, Impact of COVID-19 on Digital Transformation, we found that 82% of consumers will continue to use websites or mobile applications to complete tasks in the same capacity after the current stay at home period is lifted.

xMatters is already built for the managing on-call teams success successfully. Our integration platform automates the process of sharing data between tools and identifying the on-call team members to manage the resolution process. Actionable alerts enable team members to do work without toggling between systems, saving valuable time and easily engaging other team members while also preserving conversations and other information. These pieces of information keep customers, team members, and executives informed of progress, keep relevant information in one place for ease of use, and leave a repository of data for the post mortem.

Transforming your digital services starts with leveling up your on-call practices with xMatters. Get started with xMatters for free, today.

Try xMatters today!