Uptime Blog

BACK TO BLOG

Secrets of On-Call

Secrets of On-Call
Todd Crane ON Aug 26, 2020

Before joining xMatters as a solution architect, I spent over five years at a major healthcare company. As someone passionate about process efficiency, improving the incident management process and unlocking the secrets of on-call was an exciting opportunity. I quickly learned that the service management team (tasked with owning incident management) worked across several other parts of the organization – including database, network, application, security, and DevOps teams – but the problem was, each operated with its own tools and processes.

When I first joined the company, I got to see first-hand how it struggled to fix issues quickly. When an incident arose, the service management team spent those first critical minutes (sometimes hours) digging through spreadsheets to find the right person to call (and then the next right person to call, when the first responder was unresponsive). Lack of accountability was often a blocker to assessing performance and improving processes. Respondents could claim they never received an alert (and our system system could not record whether they did, in fact, receive it).

We had no centralized on-call schedules, no accountability from responders, and no reporting. We needed a better way to do on-call.

Problem #1: We couldn’t tell what alerts were acknowledged (and which weren’t)

The Solution: When we implemented xMatters for on-call management, our responders quickly realized that the system could tell us who was notified (and who responded), and that hitting “acknowledge” kept issues from being escalated due to lack of response. (Because who really wants their manager to be the one getting woken up because the team went radio silent?)

When we implemented xMatters for on-call management, our responders quickly realized that the system could tell us who was notified (and who responded), and that hitting “acknowledge” kept issues from being escalated due to lack of response. (Because who really wants their manager to be the one getting woken up because the team went radio silent?)

Timeline: Gain ‘play-by-play’ visibility into event progression and valuable information for post-mortem analysis to improve resolution processes. Monitor team performance when fixing issues to gauge who needs coaching.

 

Problem #2: Who was actually on-call?

The Solution: One of the biggest pain points was just clocking in. People had to clock in to signify they were on call, then they had to log into the scheduling system and enter their time so they could get paid. We configured an xMatters “clock in” response option which synced the two systems. Triggering both systems with the push of a button was an efficiency win that made our on-call resources’ jobs that much easier.

We configured an xMatters "clock in" response option which synced the two systems. Triggering both systems with the push of a button was an efficiency win that made our on-call resources’ jobs that much easier.

Escalations: Automate escalations to prevent issues from going unacknowledged and get the help you need to fix things fast.

Problem #3: We had to manually create and update service desk tickets related to the incident

The Solution: We integrated xMatters with our service desk so that from the xMatters notification, on-call personnel could push a button and create (or update) the service desk ticket. Automating these formerly manual steps prevented on-call resources from logging into separate systems to enter all the incident details was a huge process improvement.

We integrated xMatters with our service desk so that from the xMatters notification, on-call personnel could push a button and create (or update) the service desk ticket. Automating these formerly manual steps prevented on-call resources from logging into separate systems to enter all the incident details was a huge process improvement.

Self-service: Easily manage shifts, mark absences, take action on alerts, and view reports – all from your mobile app.

Problem #4: How do we collaborate (easily) on the fix?

The Solution: Our next integration with xMatters was to our ChatOps tool (where many of our teams liked to do their work). Automatically spinning up (and sharing) a chat channel for the incident made it easy for teams to swarm, and the xMatters chatbot allowed us to quickly reference our on-call schedule to see who else needed to join the conversation.

Automatically spinning up (and sharing) a chat channel for the incident made it easy for teams to swarm, and the xMatters chatbot allowed us to quickly reference our on-call schedule to see who else needed to join the conversation.

From the xMatters Slack bot, execute functions across your incident management toolchain to orchestrate and resolve issues without leaving chat.

Problem #5: How did we do resolving the incident?

The Solution: We were moving toward blameless postmortems that focused on metrics we could learn from (and improve). We made extensive use of analytics and postmortem capabilities in xMatters. With these capabilities, we could clearly see what happened throughout the incident and which on-call resources and applications were involved. Yes, it’s key to understand mean time to resolution – but getting the full picture was key to learning how to do better next time.

Use xMatters Time Machine like instant replay for incident management. See what tactics and strategies your teams used to handle incidents—learn what worked… and what really didn’t.

Use xMatters Time Machine like instant replay for incident management. See what tactics and strategies your teams used to handle incidents—learn what worked… and what really didn’t.

Fixing all these major on-call process efficiency issues with xMatters set my former company up to deliver better customer experiences (and made our on-call resources’ lives easier). Nary a tear was shed when we retired the spreadsheets; managers did not miss the middle-of-the-night phone calls; and post-incident discussions became productive and informed.

When I finally departed this company after five years, it was to double-down on the solution that allowed me to affect such meaningful change: xMatters. I’d learned the secrets of on-call management at my former company, and was now excited to help others do the same. As someone who was an xMatters customer first, I know what it’s like to struggle with spreadsheets and manual toil – and how to solve them with one solution.

xMatters is free to try for as long as your want for teams up to 10 people. You get unlimited integrations like Cloud Run that you can use with Flow Designer for all your incident response and management needs.

Ready to unlock the biggest secrets of on-call made easy?

Try xMatters Free