Discover why PagerDuty users are switching to xMatters. Listen to insights from Ben Narramore, Director of Global Operations at PlayStation.Watch webinar

Uptime Blog

Five Ways Data Can Help Prevent Major Incidents

Five Ways Data Can Help Prevent and Mitigate Major Incidents

Regardless of how hard you try, major incidents will occur. How you use major incident data will determine how much of an impact an incident will have on your business, your customers and your revenue. The last step that should be in your MIM process is arguably the most important: prevention.

However, your organization can affect how often major incidents occur and how impactful they are. All the mitigating actions you take come down to collecting, sharing, and using data effectively.

How you use data has a fundamental impact in five key activities:

  1. Recognize and respond
  2. Triage and restore
  3. Resolve
  4. Collect information and review
  5. Prevent

Recognize and respond
We define a major incident as one that affects customers and their ability to get work done. There is no industry standard for identifying a major incident. In general, you’re looking for incidents that have the following characteristics:

  • Urgent: the problem will affect deadlines or meeting client expectations
  • High impact: the problem will have a negative impact on business operations
  • Severe: the problem will have severe repercussions for users and customers
5 Best Practices to Automate Major Incident Management

Read: Five Best Practices to Automate Major Incident Management

In most large organizations, the people in the NOC or at the service desk are inundated with alerts and notifications. Most of them are routine and many are redundant, so it’s easy for something more important or sinister to slip through the cracks.

So it’s important to identify which data points to focus on and to automate collection and correlation of alerts and events from all your monitoring systems. You can use a Manager of Managers tool like Moogsoft to help with these functions. Correlating data from your monitoring systems as early as possible can help you recognize trends before they can affect customers.

By integrating your collaboration platform with your monitoring and MOM tools, you can help your front-line staff alert the service desk and incident resolution teams earlier. You can both reduce the number of major incidents and start triage more quickly.

xMatters uses deep two-way integrations based on REST API calls to move key data between systems. So when xMatters moves monitoring data into a service desk notification, it also keeps a record of that transaction for permanent storage in the service desk.

Triage and restore
Getting alerts out to resolution teams as early in the process is crucial. In fact, when xMatters surveyed more than 760 IT professionals on major incidents in 2018, they indicated that 41% can engage the right people within five minutes.

Major incident resolution is extremely complex. It requires:

  • The participation of a group of experts and stakeholders
  • Data that is collected from a variety of sources
  • Communicating to stakeholders without delaying the resolution process
  • A documented procedure to coordinate all the people and steps in the process

Quick actions in the first minutes of discovery can have a big impact because every second counts. In a 2015 survey, most respondents indicated that an outage started to affect customers within 15 minutes, so have a process in place and automate as much of it as makes sense. It’s almost impossible to overstate how important these initial actions are.

That’s why the xMatters integrations with monitoring tools and service desks like ServiceNow include the ability to automatically pull key major incident data from monitoring events into the service desk and then into notifications to resolution teams. Helping teams hit the ground running can save crucial minutes during the early stages of triage.

Major incident data can be shared again if additional people are required to join the resolution effort. Again, xMatters preserves all the steps in the system of record for post-incident review.

Meanwhile, customers and other stakeholders will ask for status updates. xMatters provides a less technical alert template for stakeholders who are not part of the resolution effort. An integration with Statuspage also provides a self-service status web page for any stakeholder.

How to Automate Data

After the triage reveals the cause of the problem, teams can resolve the issue and restore service. Once that’s done, they must let customers and other stakeholders know as soon as possible.

When an incident manager closes an issue in xMatters, she can also automatically update Statuspage to let people know before sending notifications to them.

Once restoration of full services have been restored, organizations can begin work on the bigger project: identifying and resolving the primary cause of the incident. Often the service outage or degradation itself is just a symptom or byproduct.

Collect information and review
The resolution process includes data collection, chat conversations, phone calls, resolution activities, and testing to confirm the restoration of services. Typically, once service is restored and the post-event review has been performed, those activities are purged and any record of activities disappears.

xMatters preserves a record of activities, including entire chat transcripts, in service desks like Cherwell and in issue resolution tools like Jira. Armed with this knowledge, teams can understand what they did right (or less right) and assess their overall performance.

Even better, they can review specific data points and prepare to do better the next time by:

  • Producing or updating playbooks
  • Identifying common problem areas with previous issues
  • Keeping important data as knowledge repositories
  • Keeping a log of knowledge and best practices for future issues

Now we complete the circle.

All this knowledge, correlated with larger trends, can help your teams not just respond better but to prevent new major incidents.

  • Document the indicators of a possible critical event, and you can help teams prevent a minor issue becoming a major incident.
  • Automate some common thresholds into monitoring systems, and improve the accuracy of future alerts.
  • Provide effective response options in alerts, like start a conference call, open a Jira issue, or start a Slack channel

Collect diagnostic and impact data from your systems to support management activities such as root cause analysis. From there, you can define the preventative actions you’ll need to take to help block the same incident from happening again.

Your incident management team shouldn’t be thinking about data collection before a major incident is resolved. It takes their attention away from their main objective, which is service restoration, and could delay getting operations back to normal.

You can avoid this problem by identifying the data you need to collect to do problem management before a major incident starts. The result will be that your team can focus on restoration, and use a checklist to collect the needed data along the way.

When you automate data collection, you’ll be able to restore service faster by eliminating manual tasks. And, you’ll be able to store major incident data online and use it to plan for prevention. You can relay data between systems, and engage the right people to resolve incidents with the xMatters toolchain collaboration platform.

Request a demo