Let’s Talk Feature Toggles – A Product Manager’s Perspective
Humans love the satisfying simplicity of toggle switches in the physical world. It’s a bit more complicated in the digital realm, but xMatters product managers (PMs) depend on feature toggles to manage the availability of specific functionality in our online service. Here’s a look at why we use feature toggles, our toggling process, and what we’ve learned working with them for over four years.
In this age of digital transformation, we’re all being pushed to deliver at record speed. Our recent survey of IT professionals found that 77% of respondents reported an increased release rate, with 54% releasing at least weekly. Like our customers, we’ve followed a similar path, investing in release automation that has us deploying multiple times a day.
This increased deployment frequency is awesome for killing bugs fast, but it’s too fast for introducing changes to customers. Imagine seeing interface changes every day, or even multiple times a day!
Time to toggle
We implemented a basic feature toggle management system in mid-2015 with the goal of insulating customers from our internal operations. That system empowered our PMs to turn a feature on or off based on a set release schedule.
From that simple start, the system and its processes have evolved to support toggling hundreds of features for thousands of customers across cloud hosting providers. We developed multiple iterations of the original system, and then replaced it with a rewrite that incorporated everything we’d learned. I’ll take you through the knowledge we gained along the way, including some pleasant surprises up front.
Even better than we thought
We started out trying to shield customers from disruptive continuous deployment updates. Feature toggles allow us to decouple features from releases so we can enable customer-impacting features at a time of our choosing, rather than immediately with every deployment. That allows us to give customers a heads up and provide the necessary help, training, and support.
And, to our delight, we found that feature toggles solve many other problems:
- MVP: Large features comprised of multiple smaller features can be worked on by multiple teams and delivered in multiple deployments. That means toggles are a boon for the Minimum Viable Product (MVP) process.
- Customer data: Features can be validated in production environments against customer data. It’s nearly impossible to artificially replicate the dynamic behavior of production content, so PMs love to walk customers through a beta feature with their own data!
- Rollback: New features that run into issues in production can be ‘rolled back’ (ie, toggled off) without actually rolling back the underlying code. That makes it easier to diagnose whether issues are occurring only in production. Prior to feature toggles, a deployment rollback would often resolve the offending issue but also remove other features or bug fixes that PMs wanted released (and that makes PMs sad). As a bonus, our engineers get to talk about advanced processes like canary releases and blue-green deployments (I suspect they like to talk about those as much as PMs like to talk about MVPs).
The nuances of how to implement and use feature toggles will vary from organization to organization, so I’ll caveat this section with this: your mileage may vary.
Feature toggles ain’t free
It might not be obvious, but any branching of code logic has a cost. The cost of implementing and removing a feature toggle is small, but the associated testing costs are huge. Teams must test the process at every iteration, from the creation of a feature toggle with appropriate operation in both states to the removal of the feature toggle when the behavior becomes the new normal.
Maybe this doesn’t sound too bad, but are you prepared to test for all combinations of all available feature toggles? That represents a daunting number of test cases, so you need to balance the value versus cost based on your current levels of testing automation. As you’ll read in the next two sections, we decided to limit feature toggle use so the testing burden doesn’t become overwhelming.
Feature toggles should have a fixed lifetime
We found we were using toggles for two key purposes:
- Release toggles support decoupling features from deployments
- Longer-term configurations enable customers to choose between the two behaviors the feature toggle controlled
As noted above, Feature toggles allow us to decouple features from releases so we can enable customer-impacting features at a time that makes sense.
Given the high cost of testing, we decided to remove ‘release’ feature toggles as part of our process of preparing for the next release. Our current release (not deployment) cadence is quarterly, which means that during each quarter we want the previous release’s feature toggles removed, along with any of the crufty code it’s avoiding. That helps keep our codebase cleaner and makes our feature toggle system simpler to operate. It has fewer entries to manage and a reasonable number of test cases.
That covers the first purpose, but longer-term customer configuration is not something we actually planned to use toggles for. Our feature toggles aren’t meant to configure a customer instance… they’re meant to be release toggles. However, as we refined our process for delivering our digital service, it became clear that there will be cases where a new behavior just isn’t a proper replacement because both feature toggle ‘paths’ are required, or it’s still short of an MVP for some customers.
Either of these conditions triggers a PM process to move the feature toggle to an in-product configuration or to iterate on the feature until there are no customers requiring exceptions. This allows us to remove the feature toggle and ultimately achieves our goal of limiting feature toggles.
Feature toggles aren’t required for everything
As our survey results show, we’re all being pushed to bring ever-faster changes to our digital services. But that’s at odds with our stated requirement of holding back features to coincide with a release. So how do we balance the ‘need for speed’ with the competing requirement for proper rollout training? While there’s no right answer, we established some guidelines to help support judgement calls…
Feature toggles are required for:
- Features that substantially change the user experience, allowing us to provide change notices and training materials
- Anything that changes the processing flow, allowing us to roll back processing changes without rolling back the actual code
- Any feature with a known risk of needing a rollback
- Any feature where the MVP will require multiple deployments
Feature toggles are not required for:
- Features in admin-type screens or our workflow design screens. These screens allow admins and developers to perform their own feature rollouts, so we want these to go out as soon as they’re ready. (Note that features that hit the ‘known risk’ or ‘MVP’ guidelines above are still behind feature toggles even if they’re in the admin or workflow screens)
All other features are a judgment call based on the risk (is it something that might need rolling back?) and perceived impact on core application behavior (will users be confused because you moved their cheese?).
Feature toggles aren’t just for the web UI
Initially our feature toggles were processed only by the web UI. There were additional technical nuances to our mobile apps that required additional work. Ideally, that should be sorted out from the very beginning of feature toggle adoption so the web UI and the mobile clients can respect the same set of feature toggles. But that would have involved implementation delays, so we proceeded with web UI first and added it to each of the mobile clients as necessary.
Because that wasn’t ideal, our advice is to ensure you’re factoring the development effort into your user interfaces.
Feature toggle organization
Our toggle implementation is hierarchical and supports inheritance and overrides. Again, YMMV, but our hierarchy is global > region > instance type > customer. That supports toggling a feature as follows, starting at the top of the hierarchy:
- global: sets the default value that is inherited down the hierarchy
- region: overrides the default for inhabitants of a specified Google Cloud region. We’ve used this when a feature required specific regional cloud service enablement, or when there was a need to align with regional time zones
- instance type: overrides either of the previous levels by non-production, production, or Early Access Program (EAP) instance types. This level is key to being able to run our EAP, as well as providing our customers two weeks of non-production access to a new release prior to a production release
- customer: overrides all previous levels to target a single customer instance in any region, of any type
As you can see, without hierarchy and inheritance we’d need to manage thousands of toggles with every release!
Security & avoiding sharp objects
If you’re not going to go with a commercial product, ensure the system you’re building can be accessed only by authorized users. It’s obvious that you’ll want to prevent hacking, but another key purpose is to authorize only individuals in your organization who’ve received proper training.
Training isn’t required because it’s difficult to make changes, but rather because authorized togglers need to understand the power and impact of their changes. Well-intentioned employees can accidentally break customer environments, give away licensed features, or cause system-level issues. (As Uncle Ben said: “With great power comes great responsibility.”)
Make it auditable
Along with authorization, auditing feature toggle state changes is important. Changes to a feature toggle state should be reportable down to the time and the user that made the change. This information can play a key part of in-progress incident diagnosis and post-incident root cause analysis as improper change management is still a leading cause of incidents.
Provide programmatic access
In addition to a slick UI that allows a product owner to set the toggles for our customer base with as few clicks as possible, programmatic access has also proven important. Our deployment automation needs to set feature toggles, so update capabilities must exist. And while hierarchical organization is great for manually manipulating data you know you need to adjust, it makes it difficult to find exceptions. So, you’ll need read capabilities that allow the creation of reports so you can find the instances that override specific toggles.
Get Toggling, Gain Control
While implementing feature toggling isn’t as intuitive as flicking a light switch, it’s worth the up-front investment to avoid the nightmare of feature release chaos. We encourage you to learn from our feature toggle journey and to start implementing a system that’s symbiotic with your organization’s internal processes and – especially – with your customers’ needs!
Take Our Advanced Features or Get xMatters Free Forever
Create a free xMatters account for up to 10 users and use it for as long as you want, and upgrade anytime. Or sign up for a 14-day trial of our advanced features!