Best Practices for Managing Incidents at Varying Severity Levels
A software incident is an event or unplanned interruption that causes the software to deviate from its intended behavior, affecting the quality of service. With the ever-changing nature of the software industry, incidents are inevitable, particularly in teams that practice iterative software development cycles with constant releases to production. This necessitates a robust incident management strategy.
The first step in an effective incident management strategy is classifying the incidents based on their impact. By placing incidents into categories that reflect their severity, incident response teams have a guide to incident severity levels that enables them to gauge how quickly they should remediate the issue and how many resources should be devoted.
Severity levels refer to the impact the incident has on the software or business. The most common classification of severity is a five-level system:
- SEV1 is a critical issue affecting a significant number of users in a production environment.
- SEV2 is a major issue affecting a subset of users in a production environment.
- SEV3 is a moderate incident causing errors or minor problems for a small number of users.
- SEV4 is a relatively minor issue affecting customer experience without degrading core functionality.
- SEV 5 is a low-level issue like UI errors that don’t impair
Properly defining the severity levels is a huge help for team members dealing with these incidents.
Responding to Incidents Based on Severity
This section dives into the details of these levels to learn more about their impact and provides some best practices for each severity level that can help remediate the incident with a response proportional to its impact.
SEV1 is the highest level of severity. This could mean a data leak or unexpected downtime that impacts a significant chunk of customers. The typical period for solving this type of incident should be within four to eight hours, and the target response time should be immediate.
Once a SEV1 incident occurs, the responding team should contact relevant stakeholders, send out internal communication to get the help they need, and send out external communication to users to assure them that the team is working on a solution. The focus of the responding team is to restore the service and functionality that went down immediately, even if it’s a “quick fix” that may need a deeper investigation later. Updates should go out every 30 minutes to the internal and external stakeholders about the progress of the fix. Containment and isolation are the two main steps in remediating a SEV 1 incident. Usually, this involves isolating the server where the incident occurred and severing the connection to other servers to contain the incident.
Once a fix is in production, the impacted teams must monitor the application for at least an hour to confirm it recovered successfully. Every SEV 1 incident should include a root cause analysis (RCA) report or post-mortem to understand what caused the downtime and to implement a long-term fix to prevent it from happening again. You should also document all findings and share them with the internal teams. SEV1 incidents, if not solved soon enough, can cause catastrophic losses to user-dependant companies. For example, Facebook lost nearly $100 million during a six-hour SEV 1 outage in October 2021.
SEV2 incidents are second in priority, and while they don’t demand the same urgency as SEV1, you should still treat them as a top priority as their impact can cause damage as severe as SEV1 if left unresolved for a long time. SEV2 has a resolution time of 24 hours, and the target incident response time should be within 10 minutes.
The focus of responding teams should be to minimize the damage caused by SEV2 and prevent it from growing further. Usually, this involves steps such as:
- Rolling back a bad deployment
- Performing a hard restart if it’s a data center crash
- Diverting traffic to backup data centers if it’s a service degradation
The incident response team should stay in touch with the team working on the fix and send out internal and external communications every four hours until the incident is resolved. If you can’t find a long-term solution within 24 hours, you should get additional experts to fix it quickly, then revisit the incident for a long-term fix.
Upon resolution, thoroughly document the incident with an RCA and a post-mortem report.
Although SEV3 incidents lack the urgency of SEV1 and SEV2, you should still treat them with priority in the absence of any other incidents. Since SEV3 incidents don’t generally impact production, you have a longer time to find a stable and long-term solution. The resolution window for SEV3 is around 48 to 72 hours, and the response time should be within one hour.
SEV3 incidents affect a subset of users, causing a partial loss of functionality. Steps to remediate the incident typically include pushing a fix for the bug as soon as possible, along with end-to-end test cases that can detect the bugs before merging the code. Additionally, an in-depth code review can help identify the bugs earlier. You should also document SEV3 with an RCA. But you don’t need a post-mortem analysis due to the long resolution time, as a long-term solution should’ve already been implemented during the initial fix.
SEV4 incidents often don’t require a full-blown incident response, and responders can handle them as they see fit. The impact of SEV4 on production is relatively low compared to other incidents. Its resolution time varies from three to five days.
SEV4 incidents commonly result from slowness in some parts of an application. For example, this could be a slower-than-average load time for some functionality. These may be remediated by having a mandatory core web vitals (CWV) check before merging any changes. Other steps include reducing the bundle size of the application using dynamic imports. Since these incidents are a low priority, there’s enough room to conduct in-depth investigations and propose resolutions. However, it is vital to keep track of the issue and take necessary actions if it escalates.
SEV5 issues either don’t directly impact the user experience or only degrade the quality of the experience in subtle ways. The impact of SEV5 incidents is the lowest. There’s no need to define a resolution window, as you can handle these issues through tickets or issue-tracking systems.
These problems are often cosmetic bugs and minor design issues that don’t impede product usability. They might involve creating bug tickets to fix user interface (UI) issues. You should prioritize these incidents according to the resolution team’s availability and the urgency of the fix. You can prevent SEV5 problems by having quality assurance (QA) teams perform thorough UI checks. However, SEV5 problems should still be well documented and recorded. You should take time to keep track of them and have them addressed by the appropriate team members when time allows.
It might seem difficult to gain control over incidents. However, sophisticated automated monitoring can give you some control, which can help you make better incident response decisions. Automated monitoring tools like New Relic can continuously monitor your servers and load times year-round and notify your team if they spot any application performance breaches. You can also combine New Relic with on-call tools like Opsgenie that notify the incident response team even during off hours in case there’s a SEV1 or SEV2 incident. You can customize the notification to the team’s needs and configure it to factor in lower-level incidents. Other tools that offer adaptive incident management combine both features used in the above tools to provide an automated incident response process.
In addition to using these tools, you should also hold frequent operational health meetings with your teams to revisit recent incidents and brainstorm how to avoid or mitigate them in the future. Depending on your company, these could be done biweekly or monthly. Keeping everyone on the same page with your resolution strategies will ensure a smoother fix when incidents occur.