When Does a Problem Become an Incident?
Incident management is a practice that seeks to resolve business-impacting events in the most efficient manner possible. But not every problem that arises requires an incident response, and it’s crucial that teams know the difference between a problem and an incident. Responding to problems may be part of daily routines, or small ad hoc projects that don’t require more than one resource or a significant time commitment. Effectively responding to incidents can demand a high investment in cost, time, and resources.
It’s important to define when a problem truly becomes an incident worthy of devoting the full attention of your incident management team. Let’s look at the differences between a problem and an incident.
Problems versus Incidents
ITIL, the Information Technology Infrastructure Library, defines problems and incidents as:
- A problem is “a cause or potential cause of one or more incidents.”
- An incident is “an unplanned interruption to a service or reduction in the quality of a service.”
It’s fair to characterize the difference between problems and incidents in terms of severity. Problems are less impactful than incidents as they do not cause disruptions. They can, however, contribute to future incidents. When determining whether an event is a problem or an incident, consider the following:
- Does the event render a service unavailable?
- Does the event affect service quality?
- Does the event affect a large number of users?
- Does the event pose a threat to revenue?
- Does the event put the company’s reputation at risk?
To establish whether an event is a problem or an incident in a live environment, assign a severity level based on the level of intervention it demands — if someone has to do something about it right away, it’s likely an incident.
However, even if you don’t need an immediate response, every event should be responded to in a manner consistent with its impact. Problems deemed low severity can often be deceptive, and while they may appear low risk enough to sit at the bottom of a priority queue, they can be the spark that lights a flame if not remedied quickly. Teams shouldn’t sit on small problems for any longer than need be—they just shouldn’t be the alerts forcing resolvers out of bed at night.
Imagine that a code update unknowingly creates a minor error that prevents a UI element from displaying. If the absence of that UI element does not interrupt access to the service or degrade the quality of the service, this qualifies as a problem. If the absence of this UI element makes the service inaccessible, however, this becomes a contributing factor to an incident and is worthy of incident response.
A recent, real world example of this was the Google Cloud outage in November 2021, which caused multiple apps like Spotify, Discord, Etsy, and Snapchat to go offline. The outage caused 404 errors for downstream customers using Google Cloud Load Balancing (GCLB). Given its international impact on presumed millions of people, it’s clear that the outage was an incident. But, the underlying issue could have been considered just a problem; the root cause was eventually diagnosed as a race condition introduced in a bug six months earlier, which could, in rare cases, push corrupted configuration files to GCLB.
Handling Incident Response Versus Problem Management
Incident response and problem management differ in the impact they have on normal service provision. For incidents, teams need to respond through processes determined in a thoughtful plan, typically including a full suite of incident management capabilities to restore operations as quickly as possible. Problem response, on the other hand, tends to happen after problems are noted, prioritized, and assigned to an individual for review.
An incident begins when monitoring tools detect service metrics straying into unusual territory, which can be as simple as a service going down or running low on resources, or as potentially nuanced as increased error rates. This triggers the workflow described in the incident response plan.
The first step in the response workflow is to contact responders in a timely fashion. An effective incident response plan includes determining which team members should be notified by determining which members on a clearly-defined team are available on call. These team members should be notified and then given the option to acknowledge by either confirming their availability to work on the incident or by escalating to a different team member.
Responders rely on thoughtfully configured monitoring tools to provide the right amount of contextualized information. Responders like SREs or IT admins need detailed information, such as CI/CD logs, from affected systems alongside the alerts they receive through standard channels like Slack or email. Other stakeholders, such as the Chief Operations Officer or Chief Communications Officer, might be best served with a simple alert notifying them of a service interruption and subsequent messages keeping them apprised of the incident as it progresses.
At this point, there should be a clear path to determine if the incident can be resolved quickly with only automation—for example, by pulling a backup to revert configurations to an earlier state—or if it requires additional manual intervention, such as replacing faulty hardware, and which may qualify it for consideration as a major incident. In both cases, however, as much as possible of the workflow should be automated and use well-defined vocabulary in established, accessible communication channels.
After incident resolution, the incident response team must follow up with a post-mortem documenting the events leading up to and during an incident and during its resolution. This stage is also best automated wherever possible, which relieves SREs from dividing their efforts between incident resolution and collecting auditable data during high-pressure situations.
In a typical organization, a single team performs the roles of both incident response and problem management. However, by treating root cause analysis as a distinct process from real-time response, problem management avoids the tendency of SREs to prioritize immediate solutions and neglect discovering and deploying long-term fixes.
An effective incident post-mortem is valuable for problem management and resolution. The data from logging, categorizing, and prioritizing errors is used to implement a preventative strategy. Engineers identify risks by analyzing incident logs for recurring issues and trends. Then, they synthesize this information with other data from developers, testers, and external partners and suppliers.
Problem management takes into consideration a wider, more holistic analysis of the circumstances contributing to an incident. The long-term solutions it produces require a comparatively slower, more deliberate process to identify which components in a system may have contributed to an event, and it often requires creativity as well as analysis.
It’s important to note that problem management allows for greater involvement of experience and informal knowledge held by staff, but the most successful strategies are implemented in a blameless culture. During an incident, teams save time by identifying what went wrong rather than who was responsible. It’s important for organizations to create an environment that promotes opportunities for learning rather than fear of consequences. In the forensic problem management stage, team members are more likely to take responsibility and offer more objective information in a workplace culture that accepts that accidents happen.
It’s not surprising to see teams refer to problems and incidents interchangeably, as they are closely related. However, the two concepts differ in scope and immediacy. This difference demands a different level of response and a different allocation of limited IT resources.
Incidents require the responders to quickly find a solution that will restore services in the shortest time possible.
Problem management, however, is less immediate and involves taking more time to find root causes and create stable fixes. The response team needs to document the issue, investigate it, and gauge the severity of the event by looking at how significant its impact is on normal operations. They can then decide whether to respond to it as an incident or a problem.