Discover why PagerDuty users are switching to xMatters. Listen to insights from Ben Narramore, Director of Global Operations at PlayStation.Watch webinar

Uptime Blog

What It Means to Be an Incident Commander

Leadership is essential in an organization. Establishing a leadership hierarchy helps teams avoid getting confused about who to turn to with questions and concerns, allowing them to focus their efforts where needed. High-quality leadership is vital to success but becomes even more important when the pressure to resolve an issue with minimal downtime is turned up.

When the pressure is on, an incident response plan should be established alongside an incident commander to ensure responders act quickly and coordinate efficiently. The incident commander is the primary contact and coordinator for all resolvers and resources during an incident. It’s the commander’s responsibility to plan, coordinate, communicate, and lead the incident response team throughout the lifecycle of an incident, from the initial response through to post-mortem.

The Role and Responsibilities of an Incident Commander

An incident commander, sometimes known as an incident manager, is usually an IT or a DevOps team member who’s responsible for overseeing incident response. There may be multiple people who fill that role depending on the nature of the incident, and extremely long incidents can have multiple commanders working in shifts. What’s important is that there is always at least one person tasked with this responsibility.

The duties of an incident commander typically include the following:

  • Incident preparation
  • Decision-making
  • Delegation
  • Oversight
  • Team alignment
  • Escalation and resource management
  • Planning
  • Post-mortems

To better understand how an incident commander coordinates with their team to implement an incident response plan, let’s explore a hypothetical incident from the perspective of an incident commander.

Before an incident occurs, the incident commander’s role is to lead a team that defines and develops an incident response strategy. This typically includes establishing communication channels, defining escalation policies, creating and organizing runbooks, and briefing teams on incident response plans.

Take a hypothetical organization whose primary product is a customer-facing web application. One evening, network performance monitoring tools detect a huge spike in latency. An automated alert warns the on-call incident commander that there is a possible problem affecting the web application’s performance. Their first responsibility is to assess the situation. After cross-checking various monitoring metrics and testing the web application, the incident commander determines the problem is real, and that the response team needs to be alerted and work towards finding a solution.

After the commander confirms the validity of the incident, they must develop an action plan. In this case, the incident commander recognizes it’s likely a network issue hampering the application and immediately contacts network engineers to diagnose the issue further. While they get to work, the incident commander loops in other relevant stakeholders and coordinates any additional resources the resolvers needs.

The incident commander acts as the primary contact and coordinator for all resolvers and resources, informing necessary teams and stakeholders about the incident. The commander makes themselves available to answer any questions, delegates responsibilities to the incident responders, and updates the incident response strategy as the event unfolds.

Overseeing the Incident Response

The incident commander plays a pivotal role throughout the entire incident response cycle. After coordinating the initial reaction, the commander monitors communication channels and oversees the team until the issue is resolved. They take inputs from various teams as they diagnose the issue and propose strategies for remediation, discussing the issue with the those responsible for implementing the resolution and relevant stakeholders who may be affected by the incident.

During the incident, the commander takes an active role in determining the best strategy for remediation. This might involve looking into the history of past incidents to find a potential resolution, and evaluating the strengths and weaknesses of proposed resolution strategies. To help track progress and provide a record of the incident as it unfolds, the commander often appoints a “scribe” who will note each important event within the incident communication channels.

If the initial responders can’t diagnose the problem or don’t have the resources required to implement a fix, the incident commander will determine the best way to escalate the response plan. Should the resolvers fail to pinpoint the problem in the network, they may request support from more senior support. Or, if they determine the issue isn’t in the network but originates from a particular microservice, they may enlist the developers responsible for that specific service.

In either case, the incident commander’s role is to determine whether the escalation is necessary and who has the appropriate skills to provide the required tools and knowledge. The commander is also tasked with organizing relevant information for any incoming responders to bring them up to speed.

Some situations may require extensive communication between different teams and stakeholders. Depending on the resources available to the incident commander, they may appoint a communications coordinator to serve as an intermediate liaison. This frees the commander to take on a more strategic role in assessing the problem and the proposed solutions, while the communications lead ensures responders stay in the loop.

Resolution and Beyond

After the team has decided on an appropriate course of action, the incident commander must approve the strategy and ensure the team has the appropriate skills and resources to implement it. They need to remain in close contact with the various teams resolving the problem. If attempts fail to resolve the problem, the incident commander must reassess the situation and determine the next course of action by asking questions like:

  • Why did the solution fail?
  • Was the problem misdiagnosed?
  • Did the new solution introduce problems of its own that propagated the issue?
  • Does the incident response team have the appropriate skills and expertise to implement the changes?

Incident commanders need to be flexible and respond quickly to changing circumstances. While some incidents may have obvious solutions, severe incidents can introduce confusion and may have complex root causes that aren’t easy to diagnose or address. In some situations, temporary remediation steps might have to be taken to bring the service back to normal operating conditions while mapping out a long-term plan to address the underlying issue.

Even after an incident has been resolved, there is still work to be done. In the immediate aftermath, the incident commander is responsible for documenting the incident and collecting details on the root cause and how it was remediated. This information is then organized into a post-mortem report, which can be used to find new opportunities and a point of reference should the issue recur.

What Skills Does an Incident Commander Need?

Wearing the title of incident commander demands a strong skillset, including:

  • Strong communication skills
  • The ability to work under pressure
  • Interpersonal skills to help teams work effectively in stressful situations
  • Tactical thinking and the ability to quickly strategize
  • The ability to assess complex issues
  • Organizational skills for delegating responsibilities and making the best use of available resources
  • Rapid problem solving and confidence in determining a direction for team members

Giving Incident Commanders the Tools They Need

There’s no substitute for a capable, skilled incident commander. However, there are a variety of automated tools that can ease the burden of alerting, communicating, and coordinating during an incident. Automation can help with tasks such as collecting data for post-mortems, actively monitoring systems for anomalies, and even executing mitigation procedures such as deployment rollbacks.

Lighten your incident commander’s load with the help of a powerful service reliability platform. xMatters provides a rich suite of automated incident response tools that give your incident commanders and their response teams the tools they need to tackle any incident, big or small.

Request a demo