Join our next webinar "Elevating Resilience with xMatters vs. PagerDuty" April 18 | 11:30 AM ET! Sign up

Uptime Blog

What is Incident Response?

When a service is down, a system is failing, or a security issue is in the midst of occurring, organizations need a solid incident response process to get up and running again. Incident response isn’t just for high severity, lights out incidents either; if you’ve rebooted your computer to fix a problem, you’ve been an incident responder yourself!

Incidents happen, and any successful organization knows that instead of pretending that one day nothing will ever go wrong, it’s far more useful to develop a comprehensive operational response plan. And to do so, you need to know what incident response is! Let’s get into it.

What is Incident Response?

Incident response describes the process and methodology an organization uses to handle any event that impacts the organization’s services. The term refers to the steps the organization takes to mitigate the incident’s consequences. While the mitigation itself can range in scope from something quick, like restarting a service, to larger projects that require substantive changes, the incident response steps that lead your organization to determine the proper mitigation should remain largely the same.

Organizations typically record these steps in a formal document called an incident response plan. Incident response processes differ between individual companies and industries, depending on the nature of their business and the assets involved in their operation. For example, a federal government agency might focus on protecting confidential information while an online gaming platform might focus on reducing lag.

Organizations undertake an incident response to resolve issues impacting their services, helping keep their services available and customers happy while protecting systems and information from compromise.

The Anatomy of Incident Response

There are four stages to incident response:

    • Preparation
    • Detection
    • Resolution
    • Postmortem

Let’s consider how a hypothetical organization, SoftwareCo, might work through these four stages to respond to security incidents.

Preparation

The first phase of incident response begins with a review of SoftwareCo’s current security protocol. This phase starts with a risk assessment of existing threats or vulnerabilities in company systems. Then, the company prioritizes and minimizes security risks within an acceptable threshold.

Next, SoftwareCo takes inventory of its assets by priority. This list might include servers, employee workstations, applications, and networks. This information helps to determine possible incidents that could impact the asset, and the response urgency. Here, the company also takes the time to update malware protection, patch any issues, and reconfigure the network and host security as needed.

SoftwareCo drafts a communication plan, laying out who to contact during an incident, when to make contact, and how to contact them. The company assigns roles and responsibilities to the main response team, and establishes main lines of communication to other stakeholders, like human resources, legal teams, or the communications manager. Also established are contact means, like phone numbers and email addresses. All this information is input into their incident management tool, so that they can automate on-call management to improve response and resolution time.

Finally, SoftwareCo prepares its communication tools, facilities, analysis tools and resources, and mitigation tools.

Incident Detection

Detection can happen in multiple ways, by an employee, a monitoring tool, and in the worst case scenario a customer.

A SoftwareCo admin may begin to have problems logging in to an application and have a suspicion that they detected a larger issue. Or, the admin may receive a flood of alerts from their monitoring tools alerting them of an unverified user logging into a tool. They could also receive messages from a support team member that customers are experiencing issues with the tool. In all these circumstances, an issue has been detected, and it’s time for the incident response team to get to the bottom of it.

Detection is challenging for several reasons. One reason is that events come from multiple sources with varied accuracy. SoftwareCo knows that some incidents can fly under the radar even with staff expertise, which makes choosing the right detection tool so important.

After detecting the incident, SoftwareCo moves on to the next phase: resolution.

Incident Response & Resolution

Now that an issue has been detected, it’s time to resolve it. Thankfully SoftwareCo was prepared, so resolution can start right away. Because this is a security incident, the correct resolution team is pulled into a conference call via xMatters and begins to formulate a plan.

SoftwareCo prepared baseline profiles of their system and networks by monitoring regular activity and typical bandwidth use in phase one, so they can use these as a point of comparison during an incident. Staff have also become familiar with the expected behavior of systems and applications, and they maintain a log retention policy to reference old logs for prior behavior and reviewing past incidents.

In real time, the SoftwareCo security team employs a few different methods to analyze this incident. They begin by reviewing consolidated incident information, which their automated tool gleans from multiple sources like routers and applications to correlate helpful event information. If they receive error codes, the team begins referring to documentation to confirm the meaning of these codes and appropriate response options. During this stage, a scribe documents each step of the process while also collecting their findings. Documentation and evidence are essential to further analysis and preventing similar issues in the future.

Then, containment creates time for decision-making and mitigates any further damage. Actions involve failing over to redundant architecture, engaging alternate service providers, scaling up available resources, or rolling back recent deployments. Strategies vary depending on the incident, but SoftwareCo has prepared criteria for deciding which approach to use.

After containing all affected systems, SoftwareCo eradicates the issue. Eradication involves deleting malware, disabling affected user accounts, and patching vulnerabilities. They may also have to fix software bugs, replace burnt-out equipment, or take other appropriate actions.

Recovery is typically a multiphase process. In the short term, admins begin bringing systems back online while also monitoring their behavior. The company must restore processes to when they were last working correctly, repair or replace damaged files, change user credentials, and rebuild systems. In the long term, recovery requires complete changes to security policy and infrastructure so that future incidents do not reoccur.

Incident Postmortem

In the final phase, SoftwareCo works through the incident postmortem process to review the incident start to finish. The review provides a valuable opportunity to learn from experience. The company can use the analysis to understand several things:

    • How could the response be better?
    • What caused the incident to happen in the first place?
    • What are the consequences of the incident?
    • What should be done to prevent a similar incident from occurring again?

Understanding these incident aspects helps the company improve the response process and defend against future attacks, bugs, and failures.

A postmortem report can also help regain the confidence of SoftwareCo’s userbase. The report shows accountability and that SoftwareCo is acting to ensure that a particular incident doesn’t happen again.

Conclusion

Instead of hoping incidents never happen (because they undoubtedly will), teams need to know how to respond more effectively so that when incidents do occur, they know exactly what to do. By becoming familiar with the four phases of the process, preparation, detection, resolution, and postmortem, organizations are well equipped to create an incident response plan that fits the company’s needs.

With so much to plan for, incident response can seem daunting. But with the right tools, it doesn’t need to be. xMatters is a service reliability platform that helps manage workflows, improve communication, and provide real-time analytics as incidents unfold. Learn more about xMatters and how incident response automation helps you in these high-stakes, high-stress moments.

Request a demo