Join our next webinar "Elevating Resilience with xMatters vs. PagerDuty" April 18 | 11:30 AM ET! Sign up

Uptime Blog

How Incident Commanders Benefit from Actionable Insights

Knowing who is in charge helps teams avoid confusion about who to turn to during a crisis, allowing them to focus their efforts where needed. When the pressure is on, an incident commander should have an established response plan to ensure that responders act quickly and coordinate efficiently, and with actionable insights this can be made possible.

The incident commander is the primary contact and coordinator for all resolvers and resources during an incident. It’s the commander’s responsibility to plan, coordinate, communicate, and lead the incident response team throughout the lifecycle of an incident, from the initial response to the aftermath. Essentially, they’re responsible for end-to-end incident management and act as the first responder whenever a problem occurs. They must make decisions quickly and communicate them effectively. To do this, they need to access reliable information about the current situation and plan for worst-case scenarios. But to validate the issue and identify the root cause in time, they need the right tools.

This article discusses how incident commanders can extract valuable data and turn that data into actionable insights using xMatters.

Turning Information into Action

An incident commander may struggle to resolve an incident for several reasons:

  • They lack the proper context for the incident.
  • They lack a prioritization system.
  • They lack the right tools to extract these insights.

However, they must resolve incidents immediately to ensure that there is no data loss, unnecessary service interruption, or damage to the company’s reputation. To avoid this, they need to prioritize the right information and turn it into an action plan.

Extracting Critical Information for Root Cause Analysis

When an emergency occurs, all eyes are on the incident commander, who needs reliable intelligence to make good decisions. Whether it’s a cybersecurity breach, an unexpected data center outage, an unusual spike in latency, or another issue, this process starts with aggregating vast amounts of data.

Incident commanders then pore over the aggregated data, validate the problem, and identify the root cause of the issue. This process, known as root cause analysis (RCA), forms a critical part of the incident management process.

Incidents can occur in nonobvious ways, which is why using a framework such as Amazon’s Five Whys can be helpful. Here’s how you can contextualize a problem:

  • Identify the issue at hand.
  • Ask why it happened and record the reason.
  • Determine if the reason is the root cause. Could it have been prevented? Could it have been detected before the incident? Is it a result of human error? If yes, why was it possible?
  • Repeat the process by using the reason as the problem.
  • Stop when you’ve validated the reason as the root cause.

Incident commanders can gather this data from different observability tools like Dynatrace, Grafana, or Splunk or from customer or employee reports (service tickets) and similar sources.

However, manually analyzing the data can be tedious and time-consuming, which delays the resolution process. For this reason, the incident commander should access only the most critical information about the incident.

To ensure they avoid unnecessary information or data, they need to use tools that automatically create an event correlation report with the core aspects of the incident. The idea is to have the correct information to create a corrective and preventive action plan (CAPA) as soon as possible.

Diagnostic Data Types for an Incident Manager

Incident commanders should be able to access various kinds of data depending on the incident type. Here are a few examples of reliable data types they should access during common issues.

Example 1: Security Breach

Unfortunately, security breaches are far too common. For example, Uber recently discovered that their systems suffered a data breach through an employee’s account. While Uber didn’t release an official statement about what was accessed, news reports suggested that the hacker was able to access internal code and messaging data.

In this case, their incident commander would have to access several data types:

  • Security intelligence
  • Incident response logs
  • Network traffic logs
  • Log analysis data
  • Endpoint logs
  • Anti-malware logs

As in Uber’s case, the security breach may involve multiple teams or departments within the organization. Therefore, incident commanders need data from each team and department to understand what’s happening on each front and communicate effectively with everyone involved in the response.

Example 2: Addressing Latency Spikes

Suppose that your client-facing website is experiencing a latency spike. In this case, the incident commander is informed first. To assess the situation, they’ll need information such as:

  • System performance metrics
  • Traffic analytics
  • Server logs
  • Data flow maps
  • Domain name service query logs
  • Security logs

Once they have access to this data, they’ll be able to audit the server and analyze whether the latency is due to an application, server malfunction, or anomalous user behavior. When companies rely on known data types, it helps them improve their incident response times.

Example 3: Data Pipeline Issues

Data downtime refers to periods when data is either inaccurate, missing, or stored incorrectly. This issue occurs when companies don’t have a streamlined and standardized data intake and storage process. It’s an increasingly common issue that usually gets flagged only when it’s time to analyze it for specific purposes.

When a data downtime issue is detected, incident commanders need access to the following:

  • Affected databases and tools (warehouses, business intelligence tools, and data lakes)
  • Data quality logs (from the observability tool)
  • Lineage analysis
  • Metadata
  • Server logs

Poor data quality costs organizations an average of $12.9 million each year. When considering this, it’s vital to have automated platforms that can flag such issues and inform the commander as soon as possible.

The key is to collect data from all sources that matter and make sense of it in real time. This means collecting data from all your services, infrastructure, and tools—internal and external—along with user-facing systems like web servers and databases.

xMatters: Adaptive Incident Management

To ensure that incident commanders can resolve incidents in time, they must access the right insights at the right time. Here are a few key features incident commanders can take advantage of:

  • Reduce noise by filtering redundant or unnecessary alerts.
  • Classify the urgency of incoming information.
  • Automatically correlate events and consolidate data.
  • Receive enriched notifications that provide the context in one place.

Not every incident mandates the dreaded 3 AM call, so xMatters enables you to create filters to block irrelevant information. This means you can access only critical data and focus on fixing the issue. You can set these thresholds based on your preferences or standard operating procedures (SOPs) and decide who gets notified and when.

The application also solves the challenge of conducting a manual RCA by correlating alerts related to the same issue and creating a report accordingly. This helps you identify what’s wrong with the system and focus on creating a quick resolution plan.

xMatters uses multiple features to pack all your data into concise, actionable insights, helping you resolve critical issues quickly. For example, you receive enriched notifications that go beyond the “critical alert” pop-up. They consolidate all the necessary information and provide a complete picture of how the incident occurred.

To ensure that your website is scanned consistently for potential issues, you can use the xMatters’ Site 24×7 integration service. Site 24×7 is a tool that identifies problems and relays that information to the xMatters console, providing relevant diagnostic information that you can use to create a plan of action during an incident.

You can create customized flows to decide if you should take an automatic remediation action or if notifications need to be sent to the incident commander. Essentially, you can use this information to create action plans for current and future incidents, simplifying your workflow.

Conclusion

Incident commanders and a dedicated incident management team are essential players who must make decisions under extreme pressure. So, they need to see the big picture, analyze the data they need, and share their information with others in the command center. Incident commanders should be able to access various kinds of data and understand how to prioritize the proper information and turn it into an action plan.

However, filtering information to reduce noise is only possible with the right tools. For this purpose, xMatters offers signal intelligence, which can provide critical incident data. There is also Site 24×7, a service reliability platform that identifies vital issues and relays that information to the xMatters console. It provides relevant diagnostic information that you can use to create a plan of action during an incident.

The faster you can resolve an issue, the sooner you can free up resources for critical infrastructure and its applications. For this reason, real-time insights are vital for incident management. When you always have up-to-date data, you don’t need to worry about wasting time gathering information before making an actionable decision. The data is already there whenever you need it. And that means you can respond to an incident efficiently and effectively.

If you’re looking for a tool to help you automate your incident management workflows, request a demo with xMatters today.

Request a demo