This is part 1 of a blog series exploring common mistakes and best practices in testing. This week’s blog is about monitoring.
A major part of running a cloud service is seeing whether the system is healthy and performing as expected. Good monitoring should provide the necessary transparency across all aspects of your infrastructure. These might include operating systems, applications, build and deployment pipelines, web traffic, sales pipeline, and so on. Monitoring services allow teams to understand the health of all the components required to deliver your service to clients.
As a cloud service provider, you learn quickly that service downtime impacts your business in many ways. Without the right level of transparency across your technology stack, troubleshooting and investigation during an incident eat up valuable time and resources. That’s why it’s important to employ different techniques and use different levels of monitoring that will not only detect but also prevent issues and help you solve problems faster before they impact your clients.
Usually, monitoring is based on the premise that the application will detect when an error has occurred, and generate a message that can be acted on. While this works on a basic level, you’re waiting until something goes wrong before you can act, at which point the error may have already impacted your clients. You need to consider different ways of testing your service in conjunction with the monitors you have in place. There are many forms of service testing such as integration testing, component testing, black-box testing, system testing, etc. For example, white-box testing can be useful to help monitor the internal structures or architecture of your service.
Automation is king!
Many repeatable processes can be automated with the right tools. At xMatters, we prefer automation over manual-driven processes. Automation affords operators and support agents the ability to focus on higher-level tasks instead of running or coordinating groups of commands and ensuring that they worked as expected. Like all code, automations must be maintained and tested constantly to ensure they are reliable and correct when needed.
While automation is not a silver bullet, it can certainly change the way your teams do business when it matters most. The focus it provides is imperative to an incident team when resolving incidents or repairing service levels back to normal.
Understanding the variables
At xMatters, our testing must account for different environment variables for notification delivery. For instance, email notifications rely on internet connectivity and email relays, while SMS messages rely on the availability of mobile networks. Obviously, these are outside our private cloud infrastructure and out of our control.
Compounding these issues, testing in a non-production or staging environment is entirely different from production. Even when the infrastructure is an exact mirror of production, we have found most clients cannot duplicate the same traffic and the transaction volume as found in production. This makes each production environment unique, which affects the baseline benchmark for tests.
A simplified example would be testing notification delivery and user responses. In a quiet system, the response can be quick as there is little activity, and more resources available for processing. However, in a busy production system the response time may be longer depending on the levels of traffic. This is tricky for monitoring and testing since production systems always have heavier traffic than testing or staging environments.
|Testing and Monitoring Recommendations|
|Monitoring:||Employ different techniques to detect and prevent issues|
|Testing:||Testing your service in conjunction with the monitors you have in place|
|Automation:||Automate repeatable processes|
|Delivery:||Account for environment variables for notification delivery|
|Service Health:||Exercise your services in different ways to gain a holistic view|
|Transparency:||Be transparent and honest with your customers|
Incorporating system testing
This is not ground breaking as many other cloud services have provided insight into how they test their services, for example the Chaos Monkey at Netflix. At xMatters, we’ve learned over time that you need to exercise your services in different ways to have a holistic view of service health. We have incorporated many tools to exercise our service in different ways to help us know about issues before they have any business impact to our clients.
For SMS delivery, we’ve implemented a testing solution with a service vendor that provides a global network of real SMS devices. This allows us to test SMS delivery across various regions, and even different carriers within a single region, across the world. As part of this testing, we can measure when a message was sent to vendor, how long it took to reach the device, and whether the content sent matches the original message, among other things.
This information is fed back into our monitoring system which is then configured to detect various failure conditions such as internal component failures, performance issues, or upstream carrier issues. Moreover, this kind of testing not only exercises our own cloud infrastructure and components, but also the communication networks required to reach end user devices: full stack testing.
When things go sideways, it’s important to be transparent and honest. At xMatters, we strive to make sure that we provide the details that matter to our clients, that we are learning from these unfortunate events, and that we can demonstrate we are working towards improving our services for everyone. We understand that clients want to know the details and they deserve to know.
Being honest is much easier when you can demonstrate that you responded to an issue quickly and responsibly. Responding appropriately requires planning and processes. Demonstrating that you responded appropriately requires preserving issues and conversations for post mortems.
In a 2017 survey of more than 1,000 DevOps organizations, half of all responders say they lack a consistent process for responding to a major incident. The greatest delay is the time a ticket sits in the queue before an engineer touches it. You want to make sure you resolve the issue before a customer reports it. This is the essence of proactive customer service.
Coming next: Privacy Compliance