Discover why PagerDuty users are switching to xMatters. Listen to insights from Ben Narramore, Director of Global Operations at PlayStation.Watch webinar

Uptime Blog

IT Failures are Inevitable

IT Failures Are Inevitable

As infrastructure stacks grow increasingly complex and involve an ever-growing number of services, system failures are becoming more and more common. There can be a variety of reasons why systems fail: software bugs, misconfiguration or interactions between services that cause unexpected behavior, the network is down, and of course, those rare occasions where natural events can render data centers inoperative. In the face of these threats, companies are switching towards a microservices architecture for increased service availability and resilience to failure.

In a monolithic application, a single error has the potential of bringing down the entire application. On the other hand, a microservices architecture contains smaller independently deployable units, meaning that the same error won’t affect the entire system. But there is a trade-off, more moving parts mean that more things can go wrong. And if more things go wrong, more time is needed to recover those services. That doesn’t sound very appealing.

At xMatters, we take a proactive approach to handling failure. We instigate failures regularly in our infrastructure to ensure that our systems can handle the chaos without disrupting services to our customers. Here is a high-level overview of our approach.

Soak Testing

Building a fully functional software application is important, but how it performs is an equally important challenge. Performance testing such as soak testing is critical to evaluate how well our system performs under a significant load for an extended period. Doing so ensures that the changes we bring to our services support the average load we see in production and the peaks of requests we get regularly. We know with confidence that our product can handle twice that amount without having to scale up.

Automating the Chaos

Once we establish a baseline for how our services behave under load, the next step is to introduce failure events and observe what happens. Like with all software testing, it is best to have automated tools that allow the reproduction of scenarios quickly and effortlessly. In addition to verifying fixes and changes to the services, this allows running random failure scenarios on a schedule in any environment.

According to Netflix, “The best defense against major unexpected failures is to fail often.” This was the motivation behind the construction of Chaos Monkey, one of Netflix’s first chaos testing tools, which is now available on GitHub. In addition to Chaos Monkey, the last few years have seen the birth of a few different tools to help resilience testing. After reviewing a series of them, we opted to create our own tool named Cthulhu (as an analogy to the cosmic entity from H.P. Lovecraft, known for driving anything it interacts with, to the brink of insanity. ). We made our own tool mainly so we could coordinate complex failure scenarios, impacting different infrastructure technologies in a data-driven manner.

Toward Automation

There are advantages to chaos testing that go beyond asserting that systems are recovering automatically, particularly while a system is in transition toward a fully automated recovery strategy. Failure scenarios expose ways that the system can fail. Before self-healing logic is in place, the engineering team has to recover the services manually. This allows confirmation that the engineering team can recover the system within SLAs. It also serves as a drill for engineers to remain up to date as to how to keep their ever-changing system running.

Wrapping Up

Since its inception, Cthulhu has allowed us to improve the design of many of our services. Among other things, we found and addressed cases where services didn’t get updates when downstream services changed. We also found areas where monitoring was missing. This tool has become an integral part of our development cycle.

But before you start running intentional chaos, you first need to stabilize your system from real chaos. Having a service reliability platform like xMatters allows your teams to keep services running and automate incident response with highly configurable, low-code workflows. Sign up for a free xMatters instance today to learn more!

Request a demo