Top 5 Tips for Being On-Call (and Keeping Your Sanity)
CategoriesInfrastructure & Operations
I spent plenty of years as an on-call engineer, so any discussion of on-call success is especially near and dear to my heart. Although my primary role today is VP of Engineering, I’m also incident commander when necessary, which makes me accountable for the services of my teams. I have a pretty unique perspective on being on-call, incident response, and remediation.
Any degradation – anything that’s not working – is a major liability for any business. Teams on call are under the gun to support fail-safe incident response, but many of them don’t have the tools or processes to do it efficiently. And now, since COVID-19 triggered a sudden mass adoption of digital services, DevOps teams are feeling more pressure than ever.
In a recent xMatters survey of 300 technology professionals, 88% have seen increased use of their digital services by customers as a result of current work-from-home mandates. Reliability has never been more important. In fact, 79% of IT professionals say the remote work environment has increased the importance of IT infrastructure security and privacy in their organizations.
In the last 10 years, developers and ops engineers have been joining forces as DevOps in an increasingly complex environment. With this adoption of DevOps practices, the responsibilities of developing features and deploying them now fall on the same person rather than separate teams. Developers now find themselves being paged after hours when their services aren’t behaving exactly as they expected.
Common Challenges of being On-Call
- Manual Processes: For those just getting started with being on-call, systems usually start manually before resorting to purpose-built tools. Newcomers in this world are often found managing schedules in spreadsheets, not knowing who to page, or when to escalate issues.
- Duplicate alerts: Duplicate alerts from multiple tools can inundate teams when an incident occurs. For example, a DevOps team has three different monitoring tools all going off at the same time, about the same thing, on repeat until the issue is fixed — creating a lot of noise that keeps on-call end users from getting to the root cause.
- Lack of context: When an alert finally gets to the right person, it’s not enough to know the status of a service. End users need context, such as which service is affected, where it lives, what system is surfacing this incident, whether other systems are impacted, and if something happened in another system (for example, a code deployment) that could be the culprit. Without this context, on-call end users are forced to log into multiple systems to diagnose through trial and error instead of taking immediate action.
5 Steps for On-Call Success
1. Set your schedule
On evenings, weekends, and holidays, group members should be taking turns being the primary on-call responder. This distributes the responsibility of responding to notifications more evenly among members and minimizes possible burnout.
Creating schedules can feel like putting together a jigsaw puzzle, taking into account several factors and considerations related to team members’ availability, observed holidays, global distribution of teams, anticipated spikes in load, and unexpected absences, just to name a few. Keeping track of shifts and coverage can quickly spin out of control into a series of spreadsheets that have to be manually maintained. To simplify the process of maintaining groups and group memberships, xMatters allows you to schedule your on-call teams as simply as creating events in your calendar.
2. Set up escalation paths
A notification is like a tree falling in the forest. If it’s sent does anyone hear it? Not always. Which is why there are escalations that allow you to keep paging team members until someone responds. For example, if the first person doesn’t answer within 3 minutes, then it pages the next person, and the next, and so on until someone responds.
In my experience notifying just one or two people – the ones most likely to know how to fix a problem – and giving them time to respond or possibly even fix a problem before notifying anyone else in the group is best practice. Even just simply waiting five minutes before notifying the next person hugely reduces on-call confusion. At xMatters we refer to this as an escalation path.
3. Leverage event flood control
Tool proliferation has become a double-edged sword for the agile enterprise. As DevOps teams increase their adoption of useful apps to accelerate software development and delivery, they can add complexity and exacerbate the challenges of incident management. In the survey referenced earlier, 36% of technology professionals say understanding incident management and issue resolution are important new responsibilities for their success.
Teams may be overwhelmed by hundreds of digital events within minutes of an issue, preventing them from focusing exclusively on resolution. xMatters Event Flood Control suppresses similar requests in close succession from noisy systems, minimizing information overload in real time so teams can focus on what needs fixing. By intelligently correlating events, xMatters successfully reduces unnecessary alerts by 90% or more.
4. Get context-rich notifications
Without context, notifications are simply paging someone to get their attention. More than 25% of IT professionals say they spend more time resolving issues and incidents since COVID-19. When your monitoring tools are screaming at you, it’s hard to make sense of it all. xMatters’ Smart Notifications distill incident alert information from your monitoring and issue tracking tools into meaningful, actionable notifications.
With a fuller context, you can engage the required people, gain situational perspective, and trigger resolution steps directly from notifications. Smart notifications go way beyond “accept” and “reject.” Your team members can click responses with customized actions designed to eliminate manual work and advance workflows.
5. Customize response options
With so many teams and tools involved in delivering digital services, aligning resources to fix problems needs to be fast, efficient, reliable, and repeatable across tools, teams, and time zones. Traditionally, incident monitoring tools alert someone to respond, and that person subsequently has to manually log in and look at what the issue is to respond. Streamlining complexity and reducing mean time to resolution shouldn’t require manual steps and endless coding.
A workflow engine automatically retrieves the manual steps necessary for resolving a particular issue and presents the responder with those steps upon alerting them. In doing so, a workflow engine decides how to resolve an issue and allows the responder to make the decision with the touch of a button. xMatters developed Flow Designer as a drag-and-drop visual workflow builder that includes configurable logic for different alert types. On-call responders receive the context-rich notification and are prompted with response option buttons to launch remediation steps and workflows. Having the flexibility to build unique workflows to suit specific needs allows users to synchronize systems and guide their team through incident resolution.
Remove on-call confusion
When you become aware of an incident, finding the right people is one of the most important steps in resolving it. If the data your incident team needs is locked within a monitoring solution or another tool, the team could experience delays gaining access and moving the data to other tools.
To better manage the resolution process, xMatters developed an integration platform to automate sharing data between tools and identifying on-call team members to manage the resolution process. Saving time is a critical part of on-call success for responders. By keeping information in one place for ease of use, customers, team members, and executives can stay informed and have a repository of data for the post mortem.
On-call (and beyond) at the touch of a button
Getting an alert is just the first step when fixing service interruptions. Only xMatters on-call lets you get an alert, press a button, and restore service. Try it with xMatters Free.