Uptime Blog

IT Alerting & Incident Management

IT Alerting & Incident Management | xMatters blog

It seems like every day there’s a new term or phrase popping up that everyone’s using in a different way to mean a different thing in the context of IT alerting and event management: AIOps, intelligent routing, signal control. It’s hard to nail down a definition sometimes, and even then the definition can change when you’re not paying attention.

We thought it might be useful to help surface a few of the most common terms and provide some definitions and context to help navigate the world of IT alerting and event management. While we can’t necessarily speak for how anyone else may define these terms, we can explain how we use them here at xMatters and why they’re important.

IT Alerting and Event Management Terminology

There are a few basic IT alerting terms with relatively well-established definitions, so let’s get those out of the way first.

What’s an event?

An event is any discernible change in the state or behavior of an observed system. That’s pretty much it—the change may have an impact on the management of the infrastructure or the delivery of an IT service, but it may also be entirely benign. The most important parts of the definition are “discernible” and “observed”, which imply that the change must be noticeable and that the system is important enough to be monitored for these changes.

What’s a signal?

A monitoring tool or management system can transmit information about the nature, details, and potential impact of an event as a signal to another system—such as xMatters. An integration is essentially the process of configuring applications to send signals to and receive signals from xMatters. If you’re sending signals from system-to-system and including multiple applications, you’ve created a toolchain.

What’s an alert?

When someone needs to contact an on-call resource, or a signal triggers an automated workflow, or any other time notifications are required, xMatters generates an alert. Alerts can be expanded to include additional information about the original event or about other events in the system, and they can be filtered or suppressed if they’re repetitive or redundant. The main purpose of an alert is to allow decision-making by agents or automated processes to determine whether an event requires action.

At its heart, an alert is essential data; not just information that an event occurred, but that it occurred at this time in this system with these parameters. With xMatters, incident commanders can also raise alerts to engage resolvers and stakeholders in incident resolution.

Some long-time existing xMatters customers might be scratching their heads right now, wondering why this definition doesn’t match their experience. This is probably because we’ve been using the phrase “injected an event” instead of “sent a signal” for quite a few years now. We’ve also used “event” to refer to these capsules of information within xMatters instead of using the correct term, “alert”—look no further than the no-so-aptly named Recent Events report. Which is just to show that we’re not infallible! We’ll be making some changes soon to bring our interface and internal references more in line with these standard definitions. Stay tuned for more information.

What’s a notification?

A notification is the representation of an alert intended to be received by a human, formatted for a specific device. If you’re reading an email, listening to a voice call, or typing out an SMS, you’re dealing with a notification. Notifications are also sometimes called messages, but it’s important to note that notifications and alerts are not the same thing. 

An actionable notification is a notification that allows the recipient to take action. By choosing between the different response options, recipients can use responses to engage in an incident, escalate alerts, launch workflows containing automation to address the originating event, and much more.

What’s an incident?

An incident can be declared or initiated in response to an event or a combination of events that affects the confidentiality, integrity, or availability of a system or organization in a way that could impact core business processes

That’s really just a fancy way of saying that events become incidents because of business impact. Of course, not every incident has to be “get everyone on a conference call right now” type of problem; there are different severities of incidents. The primary distinction between an event and an incident is that the incident requires someone or something to intervene and fix it. In xMatters, incoming signals can automatically initiate incidents as part of a workflow, or Incident Commanders can launch an incident manually if they determine that an event or sequence of events requires intervention.

What’s intelligent routing?

Intelligent routing is often used in call centers to determine which agent is best suited to respond to a specific customer question. But, intelligent routing also applies to IT alerting.

When a digital service experiences an issue and someone needs to know about it, noticing the problem and recording its details is just the beginning. The hard part is figuring out WHO needs to know about it, and HOW to get in touch with them. Systems equipped with intelligent routing, such as xMatters, use the details within the alert—such as the affected service, time of day, severity of the problem, whether customers are impacted, etc.—and compare them to team schedules, escalation patterns, availability, and even user preferences to determine the best way to contact the right person to handle the issue.

Automating IT Alerting

Imagine that you’re in charge of the payment processing service that allows customers to purchase items from a website. Like any good service owner, you also have a monitoring tool that watches your services closely for any signs of trouble. One day, shortly after deploying a new version of the service to the site, your monitoring tool notices that something has changed: your payment processor’s server is down and payments aren’t being authenticated. Customers are stuck refreshing the payment form, and eventually abandoning their cart without completing any transactions (the event). The monitoring tool quickly forwards all of the information about the lack of response and incomplete transactions to xMatters (the signal). 

The incoming information triggers a workflow in xMatters that automatically pulls other systems and tools for more details about the situation, evaluates its overall severity, and determines that someone needs to be notified about this right away. The workflow compiles all of the available information into a single source (the alert) and sends a message (the notification) to you, because you happen to be the correct on-call resource today (intelligent routing). 

You receive the message and can immediately tell from the information it gives you whether this is a problem that could impact customers attempting to use the site. By choosing a specific response option, you could automatically initiate a major incident management process (the incident), start a rollback of the recently deployed version, or trigger another alert to engage additional resolvers.

Automated IT Alerting Flow Chart

By automating IT alerting, teams can minimize downtime and the high cost that comes with it.

Conclusion

When something in your system goes wrong (and sooner or later it probably will), you need to know about it, preferably before your clients do. By implementing IT alerting and event management processes, you can fix business problems before they become customer problems.

To learn how xMatters can help you correlate events and produce actionable alerts and incidents, schedule a demo today.

Request a demo