Assessing Software Services for Resiliency with Chaos Testing
This is the second of a series on resilience, self-healing systems, and ongoing testing. Part one was about the inevitability of IT failures. This blog focuses on the steps xMatters took to introduce chaos testing.
As engineering teams work toward building software applications that are large-scale, distributed services that operate in cloud infrastructure, it has become imperative that software services are resilient to the inevitable failures. This makes testing for “resiliency” a crucial step in software engineering. Its purpose is to build the confidence that systems are designed with the capability to withstand and recover from failures. At xMatters we started looking at the principles of chaos engineering and how we can adopt chaos testing within our engineering department to assure our services can handle turbulent conditions without impacting the SLAs to our clients.
To facilitate such kind of tests, we took four steps:
1. Defining “Steady State” Behavior
We first simulated production-like traffic in our test infrastructure, which included setting up environments that closely matched production in terms of requests per second and types of request. We built an in-house load testing application based on Locust (a modern, open source, load generating framework) to generate a swarm of HTTP requests, targeting various services in our test environments. This traffic simulation tool can be run and stopped on demand and constitute the foundation of our resilience testing platform. By putting our test environments in a state of steady production-like traffic, we can establish a “steady state” baseline that we use as a reference when measuring the impact caused by inducing failures.
2. Monitoring Services
We defined dashboards to monitor the state of our software services and provide statistics of the traffic going through them. These also include alerting engineering teams when a failure prevents services from meeting their service level agreements (SLA) . We also have early warning alerts notifying teams when there is an increased probability of service degradation which potentially can impact SLAs.
3. Simulating Failure Scenarios
As mentioned in the first article of this series, we developed Cthulhu (our in-house chaos testing tool) to facilitate the introduction of failures within the services of our cloud-based infrastructure. The tool thus executes failure scenarios, simulating events like the untimely shutdown of services or services stuck in a dead-lock (done by pausing the application’s process). Other scenarios aim at testing the limits of our services — i.e. Given that Service A is unable to restart, how long do we have before clients are impacted?
When chaos scenarios are running, we send notifications to the engineering teams without giving details of the nature of the failure such that engineers can correlate the chaos test events with the subsequent recovery or failure alert events within the service to determine if the behavior of the service is as expected.
4. Analyzing the Differences in Service Behavior
As we monitored the impacts of executing chaos test scenarios in our services, we are able to compare it with what we know is a normal, steady state behavior (prior to any induced failures).
Our system being made of a series of distributed services, each managed by different engineering teams, we expect alerts to be triggered and to notify the right team when anomalies are detected or if the steady state of its service is compromised. For example, failure in processing requests or service not being available to process requests.
All the above steps are based on the core principle that the harder it is to impact the steady state behavior, the higher our confidence can be in the resiliency of our system and of meeting the service level agreements we made with our customers.
Chaos testing is a powerful practice to test the resiliency of software services; but because of its nature, it can have severe consequences if it’s used carelessly on an unprepared environment. We must always be aware of the potential impacts and ensure that the effects are contained to minimize disruption to our valued customers.
As we are in early adoption phase of this practice, many of these steps are currently only performed at small scale, within our test infrastructure. Such tests have enabled us to improve the resiliency of our services by detecting deficiencies earlier in the development cycle; before the rollout to production. In the near future, we plan to introduce some randomness in selecting the failure scenario and executing such tests automatically on a schedule, matching the continuous evolution of our cloud-based software.