How to build an escalation policy for effective incident management
Regardless of your organization’s size, industry, or security measures, you will inevitably face IT incidents.
But what do you do if an incident affects a critical system and your on-call responders can’t resolve it? Does your team have a set of clearly outlined next steps they should take to handle the issue?
Answering these questions can be complicated, even more so for large organizations that rely on cloud-based services to fuel their IT environment. Organizations of every size should be ready to handle incidents systematically and reduce their impacts on operations by specifying clear, auditable procedures for their teams to follow.
An escalation policy is a written procedure that guides team members on how to escalate the incident management process. It outlines the upward flow of alerts and responsibility within your organization and ensures the necessary parties are brought on board at the appropriate time in an incident’s lifecycle.
Let’s discuss the concept of an escalation policy and examine how to build an effective policy to support your organization’s incident response plan.
Building an escalation policy
The basic structure of an escalation policy is as follows: when an incident occurs, inform the first on-call responder; if the responder doesn’t acknowledge the alert within a certain number of minutes, escalate to the second on-call.
However, an escalation policy is not limited to specifying who to notify. In addition to this are guidelines for escalations within the incident response process, such as which other responders or teams to contact if an incident can’t be resolved by a single responder.
Comprehensively defining the escalation process in detail involves answering several important questions:
- When should potential scenarios be escalated? Provide clear criteria for when an incident should be escalated to the next level. You should also specify if the steps vary depending on if the incident occurs within regular business hours.
- Who should be notified? During an incident, IT team members may need to contact staff outside their department for support. Define a comprehensive list of resolvers for your incident management solution to notify. For example, let’s say a ransomware attack affecting core business functions impacts customers’ user experience. An escalation policy needs to include additional stakeholders that need to be informed of the issue even if not directly resolving it, like senior management and the communications team.
- How should an escalation be communicated? Some incidents may be minor enough that on-call systems built within your incident management tool are comprehensive enough to handle the chain of communications. But others may require a team member making a phone call or sending a ChatOps message on their own to certain stakeholders. Make sure the necessary contact information is readily accessible, so your team members don’t waste time searching for it.
- What should happen during the escalation process? Outline every step in detail. If your organization uses automated incident response management tools, make sure team members understand the workflows. As well, your policy should outline what to do if the next level in the escalation chain is not available.
Each organization tailors its escalation policies to meet customers’ needs and accommodate its size, management structure, and the impact of its systems on overall business functions. The exact process also depends on the severity of the incident, which is commonly determined by the number of affected users or systems and the effect on useability.
Organizations typically use a scale with three or five severity levels to classify incidents. Each level requires a different response. A three-level system, for example, may classify incidents from SEV 3 to SEV 1, in increasing order of significance.
A not-so-major incident that causes errors, excessive load, or minor problems for customers in a production environment.
For example, you may have a simple escalation policy to handle a configuration error causing latency when users connect to your service. User form submissions trigger a low-priority workflow in your platform, which notifies an IT admin and creates a low-priority support ticket to be addressed the following business day. If the admin fails to resolve the issue within that day, the problem is escalated to another on-call resolver.
A severe problem affecting a limited number of users in a production environment, degrading the customer experience.
For example, you may have an escalation policy for website outages affecting some users in isolated geographic areas. Your platform should collect information from an IT team member. These details are then used to create a Zendesk ticket and a linked Jira ticket. Your incident response team members and the IT director are invited to a new Slack channel, and the IT director opens a channel with customer support representatives. If the incident remains unresolved after 20 minutes, the IT director seeks vendor assistance for support.
A critical problem affecting a significant number of users in a production environment. The issue impacts essential services, or the service is inaccessible, degrading the customer experience.
Sev 1 incidents require thoughtful staff intervention early in the process to determine when stakeholders should be notified.
If services are inaccessible, the policy should specify an immediate escalation to the highest-ranking on-call resource, and send alerts to crucial stakeholders. Workflows for a major incident include steps to collect and post-incident details, ticket IDs, and conference call information inaccessible channels with failovers. Because of how critical this process is, it’s prudent to automate and regularly test the workflow and infrastructure designated in a SEV 1 escalation policy.
How xMatters escalates incidents
Regardless of the complexity of your technology stack, it can be a demanding task to ensure your pipeline integrations are reliable enough to handle any incident. Organizations using a service reliability platform like xMatters have access to best-in-class tools to ensure their teams are working at their best. The platform provides features to automate escalation management and simplify on-call scheduling.
In xMatters, you can use groups to organize people with shared skills or responsibilities. Groups may be simple collections of members or they can have complex shift schedules, escalation timelines, and rotations that determine how and when each person should be notified.
Building and maintaining effective groups are crucial for effective incident management, and ensuring your escalation policy will be effective. Within a group, supervisors can set how members are ordered and prioritized for notifications, as well as how much time can pass after an alert has been sent before being escalated to the next on-call resource. Scheduling abilities also ensure that not every member is on-call every hour of the day, and makes planning around PTO simple.
Shifts and rotations
In xMatters, you can create on-call schedules for groups to determine who’s on call and responsible for responding to alerts. By creating a shift schedule, a supervisor can configure shift rotations and escalation timelines to ensure that only on-call staff receive notifications and that active shift members are notified only when they are needed.
For example, if your groups are configured to handle incidents on holidays, you can create shifts that occur only during holiday times.
By default, xMatters notifies on-call group members in order until it reaches an escalation delay, so members closer to the beginning of an escalation timeline will be notified more frequently. In shifts with less than 200 members, supervisors can distribute notifications more evenly among group members by configuring escalation delays in conjunction with rotations.
Rotations change the order of members in the timeline. Supervisors can set members to be rotated within the shift in one of two directions after each alert, after a specific number of shifts, or according to a calendar time frame.
Escalation delays are designed to enable responders to respond to an alert before the remaining members are notified.
For example, a supervisor can set a delay of five minutes between the first and second on-call responder, which provides them a reasonable window to acknowledge the alert. If the alert hasn’t been acknowledged during the delay period, the system will notify the next contact in the escalation timeline.
A supervisor can track escalations by assigning one of three escalation types: None, Peer, and Management. These categories are metadata tags used for reporting and don’t directly affect how an escalation functions, but offer a way to more easily track how issues are escalated through your plan.
Having an escalation policy helps to ensure that critical events are appropriately addressed by support staff and that the right teams and management are notified when an incident occurs. This gives your staff a clear incident response process and makes it much simpler to document and audit your incident response.