Components of a Resilient Architecture
This is the third of a series on resilient architecture, self-healing systems, and ongoing testing. Part one was about the inevitability of IT failures. Part two focused on the steps xMatters took to introduce chaos testing. Part three is about common elements of resilient architecture.
In a previous post, I introduced the concept of chaos engineering and gave a high-level overview of the processes behind it. In this article, I will describe some common elements and patterns of a resilient architecture, explore their value and talk about some considerations to keep in mind while using them.
Even the most resilient services can become unavailable periodically — even if only during upgrades. With distributed systems, one unavailable system can trigger chain reactions that cause dependent services downstream to also fail.
IT Organizations often provide redundancy of the services so that when one fails, backups take over without interruption. Businesses often pair redundancy with load balancers so dependent services don’t have to be concerned about which nodes are healthy and available. High availability can also help processing a larger volume of requests when the load balancer is configured to distribute requests between all available services.
This is a fairly simple pattern as long as the services are stateless. Services that maintain data, like queues and databases, need a way to keep each node of the cluster synchronized. This is worth an article just on its own, and there is a lot on that topic available on the internet.
When a service fails in a redundant environment, the dependent services can still function by connection to a backup. It is still up to the engineers to find out that a service has failed and replace the broken node so that we have a full cluster should another failure happen. A few different monitoring services can help in automating this.
In distributed systems, you must be able to aggregate the logs of all micro-services in one location. At xMatters, we use Splunk to achieve that. We defined a series of dashboards to see the state of our system at a glance and build log-based alerts that notify us when certain events (usually errors) occur.
Supervisor processes are small services that determine whether a node in a service cluster is healthy. When a supervisor process detects an issue, it can take actions to resolve or compensate for it. For example, when your services are deployed as VMs or containers in a cloud infrastructure, it can issue commands to create new instances and take down defective ones.
Alternatively, if the supervisor detects that the volume of requests is too high for the nodes in a cluster to handle, it can create additional nodes to help with the increased demand and then remove them when the traffic abates.
In spite of our best effort to ensure high availability and self-healing services, sometimes an entire service cluster is unresponsive or has such high latency that it is unusable. When that happens, consuming services can be left hanging for responses, and requests can queue up and saturate the providing service with requests it cannot fulfill. This can cause cascading failures in a system. Wrapping such service calls in circuit breaker code is a way to automatically and temporarily disable part of a system, isolating it until it is repaired.
Getting those supporting services and patterns in place in a distributed system can make a huge difference in increasing the up-time of an environment, limiting how often engineers get called in the middle of the night to recover it, and reducing how long they spend investigating issues before being able to resolve them.
Before going ahead and setting all of those up for each component of a system, it is important to keep in mind that there is a cost associated with running high-availability software. Their usage should always be balanced with the amount of downtime that is tolerable.
When used correctly, high-availability software, chaos testing, and self-healing systems can give you a big advantage in customer satisfaction. At xMatters, our focus on resiliency and security is key to being a reliable service for our customers. You can achieve great results for your customers too by following best practices.
When incidents occur (and they will), xMatters uses integrations to get notifications and information into the systems and to the people that need it for fast resolution. See for yourself by using xMatters free.