A Guide to Incident Severity Levels
Maintaining IT infrastructure is a consistent challenge for system administrators, site reliability engineers (SREs), supporting developers, and technicians. Several factors can impact system performance, cause outages, or impact customer experience.
On top of that, not all incidents are created equal. The impacts and severity of a system outage affecting 10% of your users are different from an outage impacting 90%. One way to facilitate an efficient response is by using a transparent system of incident severity levels that teams can reference easily: helping to minimize incident response time while strengthening efforts to coordinate remediation throughout the response team.
Setting this severity level system in place ahead of time helps teams quickly understand the amount of urgency required in a situation while enabling effective prioritization. Let’s explore how to define your incident severity levels and examine some popular systems for doing so. Then, we’ll discuss how your organization can put a strategy in place that works best; so your team feels empowered to react quickly and appropriately when incidents strike.
Defining Your Severity Levels
Your team needs to find and apply a common language to communicate efficiently. Whatever you pick, you should ensure your team understands the chosen language and the reasoning behind it, allowing them to comprehend each incident on a higher level.
Classifying your incident’s severity level helps ensure a consistent response and prevents confusion about how to proceed. For instance, severe incidents may demand an all-hands-on-deck response that requires contacting team members on a holiday.
Ultimately, you can employ various strategies to come up with a common language. What works best for one organization may not be ideal for another. The right language for your incident response team depends on factors such as your organization’s size, the nature and frequency of incidents, and your team’s composition. One strategy is to reference the severity levels from another team or company. You should pick a similar source—for example, a company in the same industry. Even if you may not benefit from using another company’s exact system, their levels can still form the basis for yours.
Brainstorming helps too. Your team may find it easier to identify various severity levels through a spider diagram or mind map by looking at the resulting clusters. The advantage of such a system is that it arises from a collective effort and implies a decision tree to classify incidents.
Most importantly, ensure everyone understands the wording. You can run some hypothetical incident response scenarios by team members as a stress test. If their severity levels align, you have successfully established the common language. If not, the severity levels need further refinement.
Typically, organizations adopt three or five severity levels. These usually follow a pattern like the ones below. A five-level system typically looks something like this:
|SEV 1||A critical problem affecting a significant number of users in a production environment. The issue impacts essential services or renders the service inaccessible, degrading the customer experience.|
|SEV 2||A severe problem affecting a limited number of users in a production environment, degrading the customer experience.|
|SEV 3||A not-so-major incident that causes errors, excessive load, or minor problems for customers in a production environment.|
|SEV 4||A relatively minor problem that affects customer experience without substantially degrading service functionality.|
|SEV 5||A low-level problem that causes minor errors—such as formatting or display problems—that doesn’t degrade usability.|
A three-level system could look like this:
|P1||A significant incident that has a broad impact. You should repair the problem as soon as possible to minimize downtime costs, keep customers happy, and maintain your company’s good reputation.|
|P2||A medium-level incident that may not directly cause lost revenue but may escalate without swift action.|
|P3||A low-level incident that has almost no chance of reducing revenue. Customer experience may be degraded, but not enough to make them switch to a competitor.|
In both ranking systems, we see that the impact on the business mainly determines the ranking, though they use different language—severity versus priority—to label levels. The critical question is: When an incident in that category appears, how much does it impact the company’s customer base, revenue, reputation, and other considerations?
Examine if the repercussions are only during the incident or if the impact will likely persist past the incident’s resolution. For example, will customers lose trust in the company and take their business elsewhere? Severity often boils down to whether the impact is permanent and irreversible or almost exclusive to the time of the incident.
Getting Into the Details
The SEV and Priority structures rank more impactful incidents with a lower number. This order is pure convention, and your team may reverse it. Or, you may even want to start at zero.
Using a prefix like “SEV” can help communicate the incident without much explanation. If somebody sends an email like, “We’ve encountered a SEV 1 incident” it directly communicates the background. In contrast, writing, “We’ve encountered a 1” leads to confusion, while “we’ve encountered an incident in our IT infrastructure categorized as level 1” is verbose and even a bit foggy.
While some systems are simply numbered (“0” or “1”) or follow a naming scheme (“SEV 1” or “P1”), other systems rely on words such as “critical” or “high.” The specific terminology isn’t so important, as long as everyone within the company uses the same terms. It’s much more important to consistently use a matching definition of what incidents the given level indicates.
Teams may be tempted to define and implement these severity levels according to their needs. However, all stakeholders should come together to agree on the terms’ definitions. Then, when an incident occurs, everyone from the responders to management understands what is at stake and how to respond.
The first step toward ensuring an effective incident response is to properly define and implement standardized incident severity levels. Listen to other experts and check out best practices, but be practical when creating severity levels that work best for your team.
Along with using well-defined severity levels, automation can work in the background to improve your incident response’s efficiency. From proactively managing issues to alerting on-call resources to respond to a SEV 1, service reliability platforms like xMatters help teams automate incident response to resolve issues as quickly as possible. Explore xMatters today to see how it can help maintain your infrastructure’s reliability!