Chaos engineering is the science of injecting faults and verifying the steady state of the system. In this article, we will not delve into the concepts of chaos engineering but will look into a more modern implementation approach called continuous resilience. To understand the basics of chaos engineering, refer to the article here. Traditionally chaos engineering has been known for verifying the resilience of critical systems and services in production. Recently, chaos engineering has been used to ensure resilience in the entire SDLC spectrum. Ensuring resilience in all stages of SDLC is the most efficient way to deliver the maximum availability of business-critical services to the end users.
Chaos engineering as a concept is well understood by most of the audience. Some of the challenges today in chaos engineering area are related to implementing and scaling the practice across the organization, measuring the success of such chaos experimentation efforts, and knowing what it takes to get to that final milestone of resilience in terms of timelines and efforts. The reason for these challenges is, traditionally, chaos engineering has been taught as an exercise of careful planning and running of a set of experiments in production using the GameDay approach. The success of GameDay plays into the hands of a few individuals who are responsible for designing and orchestrating the execution from time to time and are not automated like other regular quality or performance tests.
In modern chaos engineering practice, developers and QA teams share the chaos experiment development. The tests are automated in all environments and are run by all personas: Developers, QA teams, and SREs. The focus on resilience is built into every gate of SDLC, which leads us to the term Continuous Resilience. In the continuous resilience approach, we expect most chaos experiment runs to happen in the pipelines. However, continuous resilience is NOT just running chaos in pipelines; it is about automating the chaos experiment runs in all environments - Dev, QA, Pre-Prod, and Prod, though at various degrees of rigor.
GameDays can still be used along with automated chaos engineering experimentation, especially in critical systems where resilience needs to be tested on a need basis. GameDays also provide a means to validate documentation, recovery procedures, and train engineers on incident response best practices.
When compared to the GameDay approach of chaos engineering, the continuous resilience approach is built around the following tenets:
Well-written chaos experiments are the building blocks of a successful chaos engineering practice. These experiments must be upgraded or modified continuously to keep up with software or infrastructure configuration changes. This is why chaos experimentation is called chaos “engineering.” The best practice would be to manage the lifecycle of chaos code similar to that of the regular software code - this means that the chaos experiment code is developed and maintained in a source code repository, tested for false positives and negatives, tested in the dev environment, and promoted for use in larger environments.
Chaos experiments should be easily tunable, importable, exportable, and shareable by multiple members of a team or by multiple teams of an organization. Chaos Hubs or centralized repositories of chaos experiments with the above features should be utilized for chaos experiment development and maintenance.
In this method, you actively involve the QA team members and developers in writing the chaos experiment development and not limiting it to just the SREs, as in the traditional GameDay approach.
A rollout of chaos practice will usually be associated with business goals such as “% decrease in production incidents” or “% decrease in recovery times,” etc. These metrics are more or less tied to production systems and are handled by the Ops teams or SREs. For example, you cannot measure uptimes and recovery times in QA test beds or pipelines. In the continuous resilience approach, where all personas are involved in chaos experimentation, the metrics to measure should be relevant to both Developers/QA teams/QA pipelines and Pre-production/Production/SREs. For this need, two new metrics are suggested:
- Resilience Scores
- Resilience Coverage
Resilience Score: The experiment may contain one or more faults against one or more services. The resilience score can be tied to a chaos experiment or a service.
Resilience scores can be used in pipelines, QA systems, and production by all the personas.
Resilience Coverage: While the resilience score covers the actual resilience of a service or an experiment, resilience coverage shows how many more chaos experiments are needed to declare the entire system as checked for resilience. It is similar to code coverage in software testing. Together with resilience scores, resilience coverage gives complete control over the resilience measurement. Resilience coverage applies to a service or a system.
The resilience coverage metric has to be used to create a practical number of chaos experiments, as it can be argued that the number of possible chaos experiments that can be created against a service can be overwhelming.
Automation of chaos experiments is critical for the achievement of continuous resilience. Resilience coverage always starts at a low number, typically <5%, where the experiments are very safe to run or run against critical services. In both cases, the assurance of resilience is essential when new code is introduced via the pipelines. Hence, these experiments must be inserted into the regular pipelines that verify the sanity or quality of the rollout process. As the chaos experiments are automated, with every development cycle, the resilience coverage will be increased in steps, eventually reaching 50% or more in a few months or quarters, depending on the dev time the dev/qa teams spend.
Suppose a pipeline reaches 80% resilience coverage with >80% resilience scores. In that case, the risk of a resilience issue or outage happening in production from known sources is mitigated to a large extent, leading to improved reliability metrics such as MTTF (Mean Time To Fail).
Pipeline Policies can be set up to mandate the insertion of chaos experiments into the pipeline.
Automating pipelines and measuring the coverage through policies mandates the development of chaos experiments by team members who know the product best, i.e., developers and QA team members. SREs can then use these experiments to enhance production-grade environments. This is the organic process in which an organization's chaos engineering practice matures from Level 1 to Level 4.
Chaos integration into the pipelines can be done in many ways. The following are some options:
Chaos experimentation is generally perceived as disruptive because it may cause unwanted delays if the services are brought down unexpectedly. For example, someone bringing down all nodes of a Kubernetes cluster in a QA environment through a chaos experiment does not help in finding a weakness, but it can cause enormous interruptions for the QA process and its timelines. Similarly, chaos experimentation processes should be associated with the required guard rails or security governance policies.
Examples of security governance around chaos experiments:
With security governance policies like those above, chaos experimentation becomes more practical at scale, and automation can increase resilience coverage.
Harness Chaos Engineering is built with the above building blocks needed to roll out the Continuous ResilienceTM approach of chaos engineering. It comes with many out-of-the-box faults, security governance, chaos hubs, the ability to integrate with CD pipelines and Feature Flags, and many more.