A Guide to Incident Severity Levels
Maintaining IT infrastructure is a consistent challenge for system administrators, site reliability engineers (SREs), supporting developers and technicians. An endless variety of factors can impact system performance, cause outages, or impact customer experience.
On top of that, not all incidents are equal. The impacts and severity of a system outage affecting 10 percent of your users are different from an outage impacting 90 percent. One way to facilitate an efficient response is by using a transparent system of incident severity levels that teams can reference easily. This helps minimize incident response time and strengthens efforts to coordinate remediation throughout the response team.
Setting this severity level system in place ahead of time helps teams quickly understand the amount of urgency required in a situation and enable effective prioritization. Let’s explore how to define your severity levels and examine some popular systems. Then, we’ll discuss how your organization can put a strategy in place that works best for you so your team feels empowered to react quickly and appropriately when incidents strike.
Defining Your Severity Levels
Your team needs to find and apply a common language to communicate efficiently. Whatever you pick, you should ensure your whole team understands the chosen language and the reasoning behind it to comprehend each incident on a high level.
Classifying your incident’s severity level helps ensure a consistent response and prevents confusion about how to proceed. For instance, severe incidents may require an all-hands-on-deck response and contacting team members on a holiday.
You can employ various strategies to come up with a common language. What works best for one organization may not be ideal for another. The right language for your incident response team depends on factors such as your organization’s size, the nature and frequency of incidents, and your team’s composition. One strategy is to reference the severity levels from another team or company. You should pick a similar source—for example, a company in the same industry. Even if you may not benefit from using another company’s exact system, their levels can still form the basis for yours.
Brainstorming helps too. Your team may identify various severity levels more easily through a spider diagram or mind map by looking at the resulting clusters. The advantage of such a system is that it arises from a collective effort, and it already implies a decision tree to classify incidents.
Most importantly, ensure everyone understands the wording. You can run some hypothetical incident response scenarios by team members as a stress test. If their severity levels align, you have successfully established the common language. If not, the severity levels need further refinement.
Typically, organizations adopt three or five severity levels. These usually follow a pattern like the ones below. A five-level system typically looks something like this:
|SEV 1||A critical problem affecting a significant number of users in a production environment. The issue impacts essential services, or the service is inaccessible, degrading customer experience.|
|SEV 2||A severe problem affecting a limited number of users in a production environment, degrading customer experience.|
|SEV 3||A not-so-major incident that causes errors, excessive load, or minor problems for customers in a production environment.|
|SEV 4||A relatively minor problem that affects customer experience, but that doesn’t substantially degrade service functionality.|
|SEV 5||A low-level problem that causes minor errors—such as formatting or display problems—that doesn’t degrade usability.|
A three-level system could look like this:
|P1||A significant incident that has a broad impact. You should repair the problem as soon as possible to minimize downtime costs, keep customers happy, and maintain your company’s good reputation.|
|P2||A medium-level incident that may not directly cause lost revenue, but that may escalate if you don’t act swiftly.|
|P3||A low-level incident that has almost no chance of reducing revenue. Customer experience may be degraded, but not enough to make them switch to a competitor.|
In both ranking systems, we see that the impact on the business mainly determines the ranking, though they use different language—severity versus priority—to label levels. The critical question is: When an incident in that category appears, how much does it impact the company’s customer base, revenue, reputation, and other considerations?
Examine if the repercussions are only during the incident, or if the impact will likely persist past the incident’s resolution. For example, will customers lose trust in the company and take their business elsewhere? Severity often boils down to whether the impact is permanent and irreversible or almost exclusive to the time of the incident.
Getting Into the Details
The SEV and Priority structures rank more impactful incidents with a lower number. This order is pure convention, and your team may reverse it. Or, you may even want to start at zero.
Using a prefix like “SEV” can help communicate the incident without much explanation. If somebody sends an email like, “We’ve encountered a SEV 1 incident” it directly communicates the background. In contrast, writing, “We’ve encountered a 1” leads to confusion, while “we’ve encountered an incident in our IT infrastructure categorized as level 1” is verbose and even a bit foggy.
While some systems are simply numbered (“0” or “1”) or follow a naming scheme (“SEV 1” or “P1”), other systems rely on words such as “critical” or “high.” The specific terminology isn’t so important, as long as everyone within the company uses the same terms. It’s much more important to consistently use a matching definition of what incidents the given level indicates.
Teams may be tempted to define and implement these severity levels according to their needs. However, all stakeholders should come together to agree on the terms’ definitions. Then, when an incident occurs, everyone from the responders to management understands what is at stake and how to respond.
Your first step toward ensuring effective incident responses is to properly define and implement standardized incident severity levels. Listen to other experts and check out best practices, but be practical when creating severity levels that work best for your team.
Along with using well-defined severity levels, automation can work in the background to improve your incident response’s efficiency. From proactively managing issues to alerting on-call resources to respond to a SEV 1, service reliability platforms like xMatters help teams automate incident response to resolve issues as quickly as possible. Explore xMatters today to see how it can help maintain your infrastructure’s reliability!