Adding chaos experiments to Harness CD pipelines is easy; it literally takes a few clicks. All you need to do is add a chaos step into a deployment stage and select the expected resilience score. In this article, we discuss in detail why (the benefits of adding chaos experiments into the pipelines of the business), what (type of chaos experiments you would need to consider injecting), and how (end-to-end technical details) of running chaos experiments in Harness CD pipelines.
The following is why organizations will run chaos experiments in CD pipelines.
Enterprises have invested in CD pipeline technology to achieve higher agility with lower developer toil. Is this ROI being offset with a potential loss of reliability? Instead of leaving it to chance, simply verify the resilience of the code you ship against the possible chaos scenarios to achieve the maximum ROI of your CD investment.
Are your developers staying current on the design and architecture changes happening? A successful functional test in a pipeline is not guaranteeing a successful validation of the design, architecture, or dependency changes. Not all developers understand these changes deeply. Well-written chaos experiments can bring gaps in these areas to the developers’ immediate attention when they break the pipelines. Developers are attending to the design gaps or implementation gaps at the earliest, in the pipelines, rather than in production at a higher cost.
Just like tech debt, there is resilience debt that can keep building on your production services. Alerts and incidents get registered in your production environment, ending in the resolved or the to-be-watched queue. These alerts and incidents sometimes result in a hot-fix/hot-code patch or a config change. Many times developers just end up adding a workaround like increasing the memory, adding CPU, or adding more nodes. In both cases, the product teams can take the feedback and act on them by adding relevant verification tests through chaos tests into the pipelines. Policies can be configured or enforced to add relevant chaos experiments before pipelines are approved for deployment. As an example, there can be a policy that states - all the incidents and alerts caused by a component misbehaving, (OR) network loss, (OR) network slowness, (OR) external APIs not responding as expected, (OR) higher load, etc., must have a corresponding chaos experiment validated in a pipeline simulation within 60 days of such an incident or alert. This will bring in a discipline to check on the resilience debt being built up. Developers and QA teams will be forced to focus on what needs to be fixed in the production code rather than continue to push more new capabilities into production, causing more resilience debt.
CD Pipelines are run when there are new code changes to be deployed. Functional and integration tests are performed to ensure that the recent changes are not breaking anything in the expected functionality or that new features are working as expected. In addition to verifying the new code functionality, pipeline runs provide an opportunity to verify the following resilience scenarios.
Resilience coverage in a pipeline always starts with a low number, and you can only increase the coverage when you have automated them well. Every change deployed through the pipeline is tested for the already-known resilience conditions. This is about ensuring the resilience score is stable in the target environment.
While code changes being deployed is one dimension, the new resilience tests or chaos experiments added to the pipeline increase the significance of the pipeline run. This is about increasing resilience coverage. If increasing the resilience coverage means that the resilience score is reduced, you found a potential weakness that is yet to cause an outage in production but is worth looking at now (i.e., an Avoidable Incident). This involves bringing the developers to look into the failed chaos experiment and then deciding to stop the pipeline or ignore it now and take action in parallel, like documenting a recovery scenario, or a note to the SREs, or suggesting a config change, etc.
A change in the underlying platform, such as the Kubernetes version, brings a lot of attention to the testing that happens in the pipeline. The resilience score may reduce, indicating new potential weaknesses are found under the upgraded platform. This scenario is incredibly important to platform owners and applications that deploy to these platforms. Often enterprises get behind in performing Kubernetes upgrades and will go through multiple version updates at a time. Usually the new versions don't introduce too many changes, but if the platform is behind, it can introduce latent issues that will impact an application team in the future. If you can introduce automated chaos tests into the CD pipeline for these applications you can get ahead of impact that would only be revealed later during an incident.
This use case requires coordination between the SRE/Ops teams and the pipeline builders to add resilience tests related to recently found incidents or alerts. Each incident or alert can potentially result in a new chaos experiment if not already added to the database.
If the new changes are about configuration changes within the software or in the target infrastructure, the resilience tests can bring new knowledge about potential weaknesses. This is another scenario where the resilience tests that passed earlier will start failing because the target environment changed through a higher or lower configuration.
Developers create chaos experiments in the chaos project. The chaos experiments are then pulled into the pipeline as steps. Each run of a chaos step results in either meeting the expected resilience score or failing to meet it, at which point the configured failure strategy of the pipeline stage can be invoked.
Create a chaos experiment and run it to make sure it runs to completion. The relevant probes are added to avoid a false positive or false negative scenario around the resilience score.
Add this chaos experiment to the pipeline as a chaos step.
Choose a failure strategy. The failure strategy is planned at the CD stage level. It can be configured against each chaos experiment or through a shell script step at the end of the execution of all chaos experiments.
Example 1: Configuring failure strategy when only one chaos experiment is in the pipeline.
Example 2: When multiple chaos experiments are added to the pipeline.
In this case, you add the expected resilience scores for each chaos step but don’t configure the failure strategy for each experiment. Instead, a separate conditional failure step (shell script-based step) is added after all the chaos steps are executed.
The shell script can access the step results (resilience scores) of all the chaos experiments, and custom logic can be written based on your situation. The non-zero return value of this conditional failure step can kick off the failure strategy of the stage.
Introducing chaos experiments into the pipelines brings in significant benefits to the reliability of the business critical services, improves developer productivity and mitigates the risk of not handling the unknowns to a large extent. Use Harness CD and Harness CE together to seamlessly introduce resilience tests into the deployment stages of your pipelines.
Harness Chaos Engineering is built with the above building blocks needed to roll out the Continuous ResilienceTM approach of chaos engineering. It comes with many out-of-the-box faults, security governance, chaos hubs, the ability to integrate with CD pipelines and Feature Flags, and many more.