Uptime Blog

What You May Not Know About Major Incident Management

You likely deal with major incidents regularly, but do you know who first coined the term? You also probably use the best tools on the market to help you fix those incidents, but do you know what some of the first tools were?

When incident management is part of your day-to-day, it’s easy to think you know it all. But we have a hunch that there are some interesting facts that haven’t crossed your mind yet! Let’s clarify what we mean by a major incident, explore some recent examples, and highlight some facts you may not know about the history of major incident management.

What Distinguishes Major Incident Response from a Typical Incident Response?

Not all incidents cause work interruptions or a loss of service. For example, if an office printer isn’t connecting to Wi-Fi, it likely doesn’t cause a severe work interruption. But if that office printer accidentally catches fire because of an electrical issue, that certainly is a high severity issue.

A major incident results in widespread damage or outage to an organization’s service or production processes. These incidents can be challenging to resolve, require substantial time to handle, and may need different steps or processes than routine incidents. Severe incidents typically have two characteristics: they affect core business functions, and they run the risk of resulting in fines from regulatory compliance bodies, such as GDPR, HIPAA, and PCI DSS. Both these concerns can result in the loss of revenue or reputation.

When minor incidents happen, teams tend to prioritize the concern and add it to their to-do list. But major incidents require a rapid, focused response from on-call resolvers, as well as a range of stakeholders who must come together to resolve the incident.

What You Might Not Know About Major Incident Management

What’s the Term’s Origin?

The term Major Incident Management was first used in 1970. The phrase stemmed from the Incident Command System (ICS) concept, the response to a series of major wildfires in Southern California and Arizona. The ICS is common among emergency management and incident response personnel in the United States. It is the base model for responding to all types of incidents, like natural disasters, radiation, and cyberattacks targeting public utilities.

The catastrophic wildfires caused severe losses in property and lives. Despite the enormous resources dedicated to stopping these fires, responders failed to respond correctly.

The organizational structure and chain of command within each organization were unclear, and there was no leader to oversee the process. The organizations also used different terminologies when communicating between various parties. On top of that, the organizations depended on third-party consultants to handle the beginning of the incident, which did not escalate the case promptly and made countering the incident’s effects more challenging and time-consuming.

What are Some Recent Major Incidents?

The ransomware attack against the Colonial Pipeline happened in April 2021. It disabled the largest fuel pipeline in the US, responsible for delivering fuel from Gulf Coast refineries to East Coast states.

The attackers used a target company employee’s compromised VPN account. The hackers found the password in a previous data leak posted on the darknet and used the password to gain access to the Colonial network. The victim company had to pay about US$5 million in Bitcoin to regain access to its hostage data and resume normal operations.

Another newsworthy incident occurred on October 4, 2021, when Facebook experienced a significant network outage. The incident lasted for six hours, preventing users from accessing their Facebook, WhatsApp, and Instagram accounts. This incident’s impact was widespread considering the applications’ significant user bases.

Early Tools of Major Incident Management

The earliest tools used to handle major incidents were mainly based on using radio telecommunications to communicate between parties and using special equipment related to each incident type. But, as technology progressed, computer-aided machines and digital technologies helped manage the various phases of the incident management process and became the norm today.

For example, in the past, a forest ranger in a tower, equipped with binoculars, would watch for wildfires. Now, a solar-powered gas sensor linked to computer systems can detect a smoldering fire and prevent it from spreading by activating a fire suppression system. Needless to say, we’ve improved in leaps and bounds for tackling major incidents.

What is the Cost of Major Incidents?

In March 2015, a 12-hour App Store outage cost Apple $25 million. In March 2019, a 14-hour outage cost Facebook an estimated $90 million loss in revenue.

But the most expensive cost of downtime on record goes to Amazon. Not necessarily affecting the online retail giant itself, but those that use Amazon Web Services (AWS). With AWS being essentially a backbone of the Internet, a 4-hour outage in March 2017 cost S&P companies an estimated $150 million and U.S. financial-service companies an estimated $160 million.

Get Ahead of Major Incidents

Major incidents range from natural disasters to online service outages. As the affected technology has evolved, so have the tools to tackle these incidents, and so has their price tag. Knowing more about the history of major incident management and some of the bigger examples may help put your own incidents into perspective, or at least give you something to talk about around the proverbial watercooler.

xMatters, a service reliability platform, helps organizations with incident management by automating workflows, enabling collaborative responses, and providing actionable analytics to improve for the future.

Ready to try out xMatters? Let us show you how it can transform your operations—request your demo today.

 

Request a demo