Continuing on our theme of things people don’t often think about when using feature flags, we want to take a look at another place where feature flags will make a huge impact for your organization, but which you may not have considered yet: incident response. It’s true – feature flags can help you gain faster incident resolution. Let’s dig into how.
Let’s take a look at how feature flags can help incident resolution. We’ll use a hypothetical scenario that maybe isn’t so hypothetical.
It’s 3pm in California and suddenly, tickets start rolling in that users can’t access a key section of the application. It appears to be a frontend bug blocking the clickable area.
The engineering team on the west coast begins to investigate, but it’s not in a service most of them work on. Most of them are backend anyway. For this service, the frontend team is largely based in Madrid where it’s midnight, or London where it’s 11pm. They’re not reachable and while they can be paged, no one likes doing that.
After some digging, it appears that a PR merged by the Europe team near the end of their day is likely a root cause. Rolling back seems like the best way to fix it. However, the ops team is based in India, where it’s currently 4:30am, so they won’t be online for a few more hours.
Here’s where we enter into two different worlds.
Without feature flags, you have three options:
- You can page your engineers in India to do a rollback, which will also revert good changes in the release (and make the corresponding docs updates incorrect for a few hours).
- You can page the team in Europe to try to push a fix out. But unlike the team in India, they’re not close to the start of their day. As such, you may have some local laws that impact working hours once you do this.
- You can simply live with the customer disruption for a few hours until India wakes up to do the revert, and then later, Europe can apply the fix. This means the customer will see an issue for a few hours, and your docs and communications around the release will take about a full day to get back in alignment. That’s if the fix is a quick one.
With feature flags, the team in California can disable the flag for that specific UI update, the rest of the release without issues remains live, and the incident is over from the customer’s perspective. From reporting to conclusion takes maybe ten minutes, no one is paged, no rollback is needed, and all the good parts of your release stay active.
Expanding a Bit More…
What we see in this scenario is that an incident that would otherwise cause disruption for multiple teams, require pages, rollbacks, or both – and require customer disruption for anywhere between a few hours and a full day – is over in ten minutes.
This is not limited to front end changes. As we explored previously, feature flags can and should be utilized for backend changes, API changes, and more. We encourage teams to think beyond “feature flags are used to release new features” and instead think about feature flags as a critical part of change management.
All changes to your application should potentially be released behind a feature flag. The flag, in this case, plays the role not just of a tool for more effective testing and releasing, but also as a way to make sure that no change to your application ever takes more than a single toggle to disable in prod.
The more you dig into feature flags, the more impact you will find – and not just with faster incident resolution.
Try to think about feature flags not only as part of your release strategy. Instead, see it as part of how you design your apps to be resilient and to guarantee a quick MTTR. That will improve your customer experience, your support satisfaction, and will keep your employees focused on doing the work they most need to do – without constant disruption for the more complicated ways that incidents can unfold.
Hope this article was useful for you! If you’d like to do some further reading on feature flags, how about reading up on 5 Common Challenges When Using Feature Flags?