Test in Production to Make Code Releases Safer
CategoriesDevOps & SRE
This is part 3 of a blog series exploring common mistakes and best practices in testing. This week’s blog is about how to test in production.
As software releases graduate from development to test, staging and production environments, it undergoes various stages of testing. A release candidate from the development environment may undergo daily regression testing. Perhaps in test, functionality and usability testing is performed. But as software and its user interactions become more complicated and time sensitive, the real rubber meets the road in only one place—test in production.
There are many types of testing: feature verification (did we built what we said we would?), integration testing (did the automated tests pass?), usability testing, reliability testing, and of course performance testing. These tests can include taking servers offline, introducing errors, and other anomalies to see how the software behaves. However, no matter how closely your testing environments mimic production, there is no greater test than doing it live.
Replicating real-world conditions
These tests in development environments do a great job of assessing the usability and general function of software, but they don’t do a great job of assessing performance in real-world conditions. Both traffic and users alike can behave in unexpected ways. Finding out that your software doesn’t behave as expected where it normally lives, in production, is never fun. Capturing, sanitizing, and replaying production traffic is often a non-trivial affair, especially in complex systems with many interactions.
|Types of Testing|
|Feature Verification:||Did we built what we said we would?|
|Integration:||Testing software modules as a group|
|Reliability:||Repeating results to increase likelihood of success|
|Performance:||Evaluating product quality|
Most organizations are comfortable talking about features and usability, but uptime and performance have been part of other departmental concerns. In a DevOps environment, that simply will not do. Uptime and performance are no longer the responsibility of Operations alone.
But, how do we resolve this? Capturing traffic and replaying it in test environments is non-trivial, and sanitized data can often remove the exact insanity that you’re trying to introduce. This isn’t to suggest that you shouldn’t do these things – you absolutely should. The longer it takes to detect a problem, the more expensive it is to resolve.
By defining SLAs for your software and testing them as part of the release process, you can catch these problems in your common scenarios, including capturing the supporting data like metrics, performance statistics and error rates. Testing as part of the release process should be a challenge to break the software, not just to validate it still behaves. Inject errors. Take systems offline, introduce chaos testing to randomly shut off components, to inject network latency or other unforeseen anomalies. Because sooner or later, they’re going to happen in prod.
Verifying that your software is meeting its SLAs prior to release in production builds the confidence to go beyond, to test in production.
Production testing to increase safety
This doesn’t mean skipping testing (known as unintentional testing) in production, but rather using production to increase the safety of your release through proven strategies. Red/black deployment and slow rollouts (canary releases) can reduce risk by allowing you to test with real users and real data. If you see an increase in errors, you can immediately rollback. Good monitoring and metrics are key. You can let software age for a few days to see how it performs over time, before exposing more users to it.
These strategies further validate the viability of a release in production, and are extremely important when making large architectural changes where the normal characteristics have changed, and ‘gut feel’ or other fuzzy acceptance measures are clearly not good enough.
A purposeful approach to testing in production reduces risk and instills the confidence to make changes, with the ultimate goal being to find problems before your customers do, no matter the circumstance or the change.