What Really Happens During an IT Event Flood—and How to Control It
CategoriesInfrastructure & Operations
One way to verify the effectiveness of IT event management is to measure the impact of an IT event flood with flood control in place and without. This blog considers an actual event flood and how event management could have eased the pain considerably. Only the names have been changed to protect the
A major concept or concern in many SRE environments today is “observability”. More than just a measure of how much instrumentation or monitoring you have, observability gets to the root of how well you can know the health of a system based on its outputs. Indeed, two of the four pillars of observability (according to Twitter) are monitoring and alerting/visualization.
But how observable can a system really be when its monitoring tools cause the alerting systems to shout the same thing over and over and over again, inundating teams with hundreds of similar events while they’re trying to resolve the issue? Well, we think that’s where event flood control can help clear things up.
As part of the xMatters intelligent event management suite, our event flood control feature is now available by default for all users, reducing the amount of unnecessary noise and unwelcome disruptions during major incidents. The interface acts as a command center for authorized users to create, modify, and apply flood control rules for each event source.
So what does that mean in day-to-day reality? Well, let’s take a look at a real-life example of what event flood control can do when something goes haywire and generates an event storm in your system. Again, this isn’t an urban legend – we’re not making these numbers up. We’ve removed or altered some identifying details to avoid accidentally outing the victim, but this actually happened.
The calm before the storm
Early on an otherwise unremarkable Tuesday morning, before anyone had even finished their first cup of coffee, a core microservice in a service mesh somewhere had a little hiccup. The hiccup was picked up by a monitoring tool, which created an incident and injected it into xMatters. So far, so good.
While xMatters was processing the event and generating and sending notifications, the original hiccup resulted in something of a chain reaction back in the service infrastructure. It turns out the microservice was on a critical path for a host of other services, and suddenly everything started blowing up. The monitoring tools started firing events into xMatters at an overwhelming pace – up to and including 100 events per minute.
Watch how another xMatters customer, Credigy Solutions, plans to use IT Event Management
To keep what happened in perspective, let’s look at how the storm affected just one team – in this case, the one tasked with maintaining the service that turned out to be at the root of the problem.
In the 60 seconds following the first event, the monitoring tool detected another 13 individual incidents, all related to the original hiccup. Each of these errors resulted in the monitoring tool creating and injecting another event for the same team into xMatters. This meant that the people who had been notified about the problem and were starting to investigate it got another 13 notifications about essentially the same issue.
Then, as suddenly as it started, everything went quiet. For about 11 minutes, anyway. And that’s when all hell broke loose.
Thirteen minutes after the hiccup, the monitoring tool injected 45 events targeting the same group into xMatters. The minute after that, another 63. The following minute, another 45. That’s more than 150 events in just three minutes! And over the next 25 minutes, the monitoring tool sent another 579 separate incidents into xMatters before going quiet again.
Can you imagine trying to put together an incident response team, start an investigation, and resolve an issue when your phone is pinging you every two-and-a-half seconds? (Or maybe you don’t have to imagine because this sort of event flood has already happened to you.)
Oh, by the way: the lull this time only provided eight minutes of respite before it all went crazy-go-nuts again. Over the next 34 minutes, xMatters received another 746 event injections, at which point the team managed to failover or bypass the problematic middleware (or just shut off the monitoring tool, we’re not sure).
All told, the monitoring tool sent 1,492 events targeting the same group into xMatters over the course of 79 minutes, or 60 minutes not including the two brief lulls. That’s almost 25 notifications every minute for an hour. And that was just the one team – across all teams who responded to the incident, there were more than 4400 events! That’s a lot of noise to deal with when you’re already trying to find and fix a problem.
|No flood control||Default rate filter|
|Events to the same group||1,492||28|
|Processing delay for new events||Up to 1.5 hours||None|
A couple of important things to note
- First, the system was working as intended. The monitoring tool is supposed to notice when things go wrong and send information into xMatters. In turn, xMatters is supposed to notify the intended recipients about the issues. The only problem was that the underlying issue snowballed, causing multiple systems to fail. (With event flood control, xMatters can correlate and suppress these similar events so teams can focus on the information they need to resolve the actual problem.)
- If your xMatters instance isn’t licensed to process 25 events in a minute, those events are going to start backing up in the processing queues. And what if something else important happens while those events are waiting to be processed? If you’re on our most popular “Base” plan, you can process 15 event requests per minute. Some quick and dirty math will tell you that your system would be processing events for more than an hour and a half before it got through the backlog – for just the one group! (This is yet another area where event flood control helps extend the value of xMatters – by correlating and suppressing related events, the number of events that need to be processed is reduced dramatically.)
Keep the storm at bay
Now, hypothetically speaking, let’s see how that storm would have played out with even just the most basic event flood control enabled. As per the default settings, event flood control will suppress correlated events if more than four events are injected in a minute from the same integration targeting the same recipients. xMatters will also send an update reminder if the flood continues for fifteen minutes or reaches 1000 events, whichever comes first.
Just for ease of tracking the timeline, let’s say the first hiccup occurs at 7:00 AM exactly. The monitoring tool detects the problem, creates an incident, and sends it along. xMatters processes the event and sends out the notifications. Smooth sailing.
At 7:01 on this fine Tuesday, the monitoring tool detects some more issues and sends 13 events to xMatters in quick succession. Event flood control kicks in when the fifth event shows up, so xMatters only processes the first four events, and suppresses the other nine (“stacking” them underneath the fourth event, which becomes the parent). xMatters then sends a notification to the intended recipients, letting them know that a flood is happening, and continues to monitor the rate of incoming events.
At this point, the service team has dealt with a total of six notifications, one of which was letting them know that an event flood had been detected, instead of 14. The first lull now occurs, and the flood stalls with no further events injected.
Now the fun begins!
At 7:13, the monitoring tool injects 45 events into xMatters. Once again, the event flood control kicks in when the fifth event arrives, and only four events are queued for processing. xMatters sends another “event flood detected” alert and starts stacking up incoming events underneath the new parent as the flood really gets going.
By 7:28, xMatters has suppressed 347 events. Even more requests are incoming, however, so xMatters sends a reminder to the recipients targeted by the parent event to let them know that the flood is ongoing and continues to suppress.
By the time the second lull hits at around 7:39, the parent event lists 724 events on the Suppression tab of its Tracking report.
Of course, the flood starts all over again eight minutes later, so xMatters processes another four events, detects a flood on the fifth event, and promptly starts suppressing. The new parent event starts getting its own stack of suppressed events, and the recipients get an alert letting them know that the flood isn’t quite finished yet.
The flood finally starts petering out 26 minutes later, after another flood reminder and an additional 718 suppressed events. The last few minutes, events are trickling in at a rate of two or three per minute, except for a brief blip of nine events just before the end. (Just to be consistent, flood control kicks in and suppresses five of those, too.)
All told, xMatters creates just 34 events, including the system events that let the team members know that a flood was happening and the “flood continues” reminders. A total of 1452 events were suppressed. When you further factor in all the other events targeting different teams, you can really get a sense of just how many unnecessary events could be blocked across the entire organization.
That’s a lot of noise, interruptions, and distractions that the teams didn’t have to deal with when they were trying to fix the original problem.
Don’t get caught in the storm.