Join our next webinar "Elevating Resilience with xMatters vs. PagerDuty" April 18 | 11:30 AM ET! Sign up

Uptime Blog

What’s a Major Incident Anyway?

A Category 2 hurricane nearing landfall, a broken freezer leaking artisanal vegan ice cream, a web store where the shopping cart can’t remember the items added to it, and a flooded basement in the sciences building on a college campus. What do these four events have in common? Depending on the industry you work in, they’re major incidents.

The term ‘major incident’ sounds like it always means a life-or-death situation or some historically monumental occurrence. But more often than not, it isn’t any of those things. Having an unclear understanding of what a major incident is can hold you and your business back from creating the systems needed to manage and resolve incidents quickly, and keep operations running smoothly.

So, what is a major incident anyway?

Major Incidents 101

Not knowing how to define a major incident is a common problem. Depending on who you ask, a major incident is defined in either acute specifics or obtuse, vague terms. For example:

    • The Free Dictionary defines a major incident as “an event or situation that threatens serious damage to human welfare, the environment, or the security of the state”.
    • Atlassian defines a major incident as “an emergency-level outage or loss of service”.
    • defines major incident in different ways even on the same web page, including “an occurrence of catastrophic proportions, resulting from the use of plant or machinery, or from activities at a workplace” and “an incident associated with the Distribution of electricity, which results in a significant interruption of service, substantial damage to equipment, or loss of life or significant injury to human beings, or as otherwise directed by the Commission”.

While these definitions are quite different, at their core they share a few commonalities. A major incident is an event, or series of events, that impacts what’s typically considered regular or normal circumstances. The impacts of major incidents can indeed be catastrophic, or they can just be as comparatively inconsequential as an interruption of service. The commonality to all major incidents is that they invariably need someone to take action, whether it’s to prevent a disaster or restore a service.

How do you define major incidents?

Now that you have an idea of what a major incident is, it’s time to consider how to define a major incident for your business.

It would be impossible to provide a one-and-done definition of what a major incident is for your business without being in your shoes. What you see and how you respond to incidents in your everyday operations is different from the next person, and how the business is managed and its goals can be an integral part of knowing which incidents are major, and which aren’t.

So, think about the last incident you dealt with that may be considered major. Does it check off any of the below boxes? If it does, it was likely a major incident.

      • Did the incident put anyone at risk for harm?
      • Did the incident pose a threat to customers’ personal or data security? 
      • Did the incident put the business at risk of losing customers? 
      • Did the incident put the business at risk of losing new sales?
      • Did the incident put the business at risk of losing employees?
      • Did the incident pose a financial loss?
      • Did the incident impact regular business operations?
      • Did the incident require human intervention to resolve?

Consider the example at the beginning of this blog “a broken freezer leaking artisanal vegan ice cream”. You probably didn’t think this was a major incident. But, it would have put a business at risk of losing new sales, it would have had financial costs in lost product, and it would have impacted regular operations. In truth, if someone slipped and fell, it also could have opened the business up for a significant lawsuit for putting a customer’s safety at risk. So, while it’s not fair to say that a hurricane and melted ice cream are equal in terms of severity, both are major incidents.

How do you handle major incidents?

When you were faced with a major incident, how was it handled? Reflecting on this can tell you a lot about how you’re currently prioritizing major incidents, and highlight ways you should consider changing your current operations.

Because we tend to under-define incidents as major, we also tend to not respond to them in the most effective way. Situations like not knowing there’s a bug in an app until a customer reports it, or having to personally call on-call responders, don’t seem like they could be super impactful. But, if the bug in the aforementioned app is an error between the database and the payment page, all of a sudden the business just lost the potential for a number of sales. And, if no one reports that bug for two weeks, that financial loss adds up quickly.

Many major incidents rely on manual intervention to be acknowledged, responded to, and resolved. But manual intervention can be slow, and there’s truth in the saying time is money. The answer to this problem unfortunately isn’t training your employees to be more responsive or work faster. The answer is for your employees to work smarter, with the support of automated processes.

Upgrading Your Major Incident Response Practices

Automating certain processes in your daily operations can make a significant impact on your major incident response practice and in turn the whole of operations in your business.

Consider our melted ice cream example. Instead of waiting for a store clerk to find that a freezer door was left open and hundreds of pints of ice cream are oozing onto the floor, an integrated workflow between the software controlling the freezer temperatures and xMatters would be monitoring and logging temperature data. When the temperature was recorded above the preset limit, a notification would be sent to the on-call store manager on the device of their choosing alerting them that something was wrong with the freezer. The manager can quickly respond to the incident, closing the freezer door and checking that the product was still safe to sell—presto, incident avoided.

Or, in our shopping cart example, instead of waiting for a customer to log a complaint that the payment process isn’t working correctly, an integrated workflow set to monitor the toolchain would identify the connection error. A notification would be sent to the on-call responders, say a DevOps Engineer and Project Manager, alerting them of the incident mere seconds after it was identified. From there, they can quickly hop on a call, find the root cause of the issue, and get working on a resolution. Customers most likely didn’t even realize there was a problem because it was identified and resolved so quickly, and the business would never risk a loss of sales.

With incident management tools in place, there’s also the benefit of simplified incident post-mortems. Instead of guessing why an incident occurred, there’s data available to you to review when, where, and sometimes how the incident occurred. This lowers the chances of the same incident happening twice, and makes for fewer major incidents going forward.

Major incidents aren’t automatically life-or-death situations, but they also shouldn’t be treated like they don’t make a notable impact on businesses. By knowing how to define a major incident and how to respond to it effectively, you can ensure your operations are running the way they should and focus your attention on the work that actually matters. Try upgrading your processes yourself, and build your first workflow today!

Request a demo