Unlock the power of data-driven feature flagging for A/B testing, gradual rollouts, and more. Learn how to implement randomization, conduct comparative analysis, and avoid common pitfalls.
Feature flags separate a code deploy from a feature release. They include a simple switch that turns a flag ON or OFF to help software teams test new features for things like performance issues. But the use cases don't end there. Feature flags are also used to try new features across population subsets. They're even effective in multivariate experiments that test several feature iterations at once.
In a typical feature flag rollout, first you deploy a flag that points to the existing code in the OFF position. The goal here is risk mitigation. Essentially, you want to find bugs before turning features ON and exposing them to your customers.
If you pass that test, you can slowly ramp things up. But you're never fully free from risk. As you slowly ramp up your features to more and more customers, new issues may arise with the increase in traffic. To truly ensure your code is making a difference at every stage of a progressive rollout, that requires data. Lots of data! And when intuitive data is paired with feature flags—that's where the magic happens.
Here's a few advanced feature flagging tips. Let's take a look.
As you begin to embrace gradual rollout strategies, you'll notice that metrics aren't as obvious at small percentages. For example: If you're just rolling out to five percent of traffic, it's easy to miss an issue without the right lens.
Let's say you do miss something: By the time you ramp that feature up to one hundred percent, any minor issue is now major. Suddenly, there's a serious spike in latency with hardly enough time to save the customer experience. This isn't ideal for any engineering team.
That's why the best feature management platforms track metrics related to all flag changes at the feature level. Everything is captured into a graph, quick and easy for the naked eye to see. Alerts can be fired if metrics degrade. Any performance change (significant or miniscule) is identified, and as a benefit, you can truly optimize your gradual rollouts.
Feature-level awareness like this is a game changer. At the very least, it might save you a few hours of triage.
Studies show that ice cream sales and shark attacks both peak in the summer, but they're not related to each other. Correlation and causation are not the same thing.
I can recall a customer undergoing a distributed denial of service attack, which in turn hit our infrastructure pretty hard and made for sad looking graphs. We wasted a ton of time trying to figure out if something was wrong with our code. As it turns out, nothing was wrong on our end; we just didn't have flag-aware monitoring in place to see things easily. Without the ability to know whether new vs old code was the culprit, we were just drowning in shark infested waters.
Luckily, science gives us a smarter way to look at these common conundrums through randomized, controlled trials.
If you randomize a population across a treatment and control, you'll be evenly distributing the effects of outside factors. If there are differences between your samples, pay close attention. These differences are what you should carefully measure. You'll notice patterns emerge, and you want to follow them. Like a data shark.
If there's equal data distribution across all randomized samples, that's the ideal. In this case, there's no cause for concern. However, if your readings show a difference between samples, statistical analysis tells us that something is not right.
Put it under a microscope, and solve the problem. Then, treat yourself to an ice cream cone.
The most modern software teams have significant telemetry coming in from their system via dashboarding data, and they have the ability to tag it. When you tag data as it passes through different experiences, you can segment it for comparative analysis. This is a powerful strategy.
Some teams even have in-house analysis, whether it be BI or a similar system. If you can pull data into these systems, you can do ad hoc queries that perform pretty sophisticated tasks. The issue is that these are not always statistically rigorous. This is why more and more teams are moving toward experimentation platforms that account for every piece of granular data at the feature level.
Years ago experimentation was an emerging Silicon Valley trend, common only at companies like LinkedIn and Facebook. Today, experimentation is backed by reliable and easily obtained tools. Now you can pair feature flags with the right data to parse it intelligently through feature management platforms, whether you call them that or not.
It's not just about feature flags. It's not just about data. It's about the two of them working together like a unit.
If you want a controlled study, randomization is central.
Take whatever population you're choosing and distribute the incoming data into two or more cohorts at random. But to do things right, there needs to be an element of stickiness for those who engage further with your experiment.
There's a way to make this happen reliably. This is done through hashing.
Basically, take an arbitrary piece of data, like the user ID. Then, run it through a hashing algorithm with a seed. The same data going in will come out as the same hash value each time, provided you give it the same seed. You want to use a different seed for each feature, so each feature rollout or experiment is independent from the others. What you end up with is a deterministic and repeatable way to show each participant a consistent experience, and to minimize interaction effects across experiments.
Hashing is an effective strategy: It evenly distributes people through the output. Suddenly, you've got randomization that's consistent, no matter how many times a person enters that experiment.
Let's say you assign people an arbitrary bucket number from zero to 99. As we ramp up and down to a particular population, people with their stable bucket numbers will be brought into or out of the experiment. But, they won't get scrambled around in some sort of re-randomization unless you change that seed.
In the attribution process, it's really important that your telemetry data includes the same key for each user (the one that assigned them an experience in the first place). That unique key can represent one user, one account, or one device, depending on your goal. This is also how the event and experience assignment data will be joined to perform attribution in the statistical engine.
During calculation and analysis, the "who got what" is now combined with "what happened as a result." Suddenly, you're coloring in the big picture and bringing your experiments into plain view.
Management consoles are where your team can manage rollouts and review metric results. Most people's first feature flagging tool is powered by a static configuration file, an entry in a database, or even a headless service. The problem with these scenarios is that only the people who have access to the rollout will be the ones to view the results. This requires deeper technical knowledge just to understand what's going on.
Eventually, you want to build or buy some kind of a front end that provides a consistent way to access data and management control. Walmart, LinkedIn and many other big companies have built these in-house. You don't need to.
Other teams rely heavily on alerts. If your goal is to accelerate software development, monitoring isn't something you should do ad hoc by "having a look now and then." Instead, look for a feature management tool with proactive alerting built in.
Proactive alerting at the feature level helps in two ways. For one, it detects that there is a problem sooner, before complaints roll in. Secondly, it knows which feature flag is the root cause, freeing up all other teams that would have been pulled into a "war room" to do triage.
Split makes an off-the-shelf experimentation platform for engineering teams to release software.
It's not so much geared toward marketers to test experiences. Sure it can do that, but where Split truly shines is how it simultaneously empowers software engineers with the control of feature management plus the measurement and learning benefits of causal analysis. With Split's Feature Data Platform (™) , the Mobius Loop of continuous delivery finally has a built-in way to receive fast feedback that frequent releases need. This lets you build software more efficiently and achieve success along the way. Oh, and that's more fun than working in the dark?
Split Arcade includes product explainer videos, clickable product tutorials, manipulatable code examples, and interactive challenges.
Split gives product development teams the confidence to release features that matter faster. It’s the only feature management and experimentation platform that automatically attributes data-driven insight to every feature that’s released—all while enabling astoundingly easy deployment, profound risk reduction, and better visibility across teams. Split offers more than a platform: It offers partnership. By sticking with customers every step of the way, Split illuminates the path toward continuous improvement and timely innovation. Switch on a free account today, schedule a demo to learn more, or contact us for further questions and support.