How Chaos Engineering Strengthens Your Disaster Recovery Plan

Updated

“Hope is not a strategy.” This quote embodies the core philosophy of chaos engineering. We can’t just sit around and hope that our business never experiences a costly service disruption. It’s essential to act now and prepare for the worst by adding chaos chaos engineering to your disaster recovery (DR) testing.

In the cloud-native world, chaos engineering is a necessity for companies, so they can be prepared for increasingly common disruptive events resulting from multiple causes, including natural disasters, cyber attacks, and unexpected technology failures. Whether you’ve already got a disaster plan in place or are just getting started, chaos engineering provides your team with an extra layer of proactive preparation.

The Cost of Incidents

Service disruptions happen in all industries. While these outages are relatively brief in the majority of cases, when they’re big, they can significantly impact an organization’s bottom line and reputation. Here are a couple of recent examples that resulted in some seriously undesirable headlines and business impact:

In January of 2023, the Federal Aviation Administration (FAA) experienced an issue with its NOTAM (Notice to Air Missions) system. Likely caused by a damaged database file, 32,000 flights were delayed within, into, or out of the United States. The cost to airlines in the United States and to affected travelers was incalculable. And, the damage to the FAA’s reputation was also significant, with some members of Congress calling for investigations into the FAA’s operations.
In December of 2022, Southwest Airlines experienced an outage that resulted in the cancelation of 16,700 flights from December 21 to 31 – peak holiday travel season. There were numerous factors in the Southwest outage, but an outdated computer system definitely played a major role. Southwest’s stock plummeted immediately following the incident, and the airline has been sued by at least one passenger affected. According to a company filing, the airline lost as much as $825 million due to the outage.
In December of 2021, Amazon Web Services (AWS) experienced multiple outages that impacted numerous businesses in the U.S., including Netflix, Slack, Amazon's Ring, and DoorDash. The first of these outages was caused by a glitch in automated software that led to unexpected behavior that then overwhelmed AWS networking devices and impacted computer systems on the east coast. The impact of the outage was far-reaching, affecting companies such as Google, Disney Plus, and Venmo.

Don’t Be the Next Headline: Create a Disaster Recovery Plan

Avoiding massive unplanned outages means taking a proactive approach to disaster recovery right now. One of the first steps is to create a disaster recovery plan (DRP). The DRP implementation is an overarching exercise encompassing technology, people, and processes. Together, these result in a playbook to achieve efficient recovery.

An important part of creating an effective DRP involves engineering teams collaborating with business leaders to list resources in the order of criticality and potential failure points associated with them (also known as a business impact analysis). Then, you need to simulate the failures and verify theoretical recovery paths (either automated or manual).

At Harness, we use the best practice of building a service map, including a listing of the criticality of its components, incident history, the code or binary components, and an understanding of the underlying infrastructure components with associated dependencies. In its simplest form, the service map needs to outline the tech stack including databases, cache, message brokers, and dependencies. This enables the team to understand the architecture of the system and what chaos experiments should be tested.

Gaining Insights from Chaos Testing Metrics

One of the benefits of chaos testing is gaining an accurate understanding of certain key metrics.

DR in the coud-native world comprises both the traditional active-passive model, as well as the now-widely adopted active-active model, with application deployment topologies featuring cross-zone or cross-region replicas. The takeaways or metrics from chaos tests differ for both the above DR models:

Uptime/availability and performance (QPS, Latency) are of primary concern in the active-active model. These help in assessing whether the Error Budget burn is within limits.
Google’s four Golden Signals of latency, traffic, errors, and saturation help provide leading indicators often represented in SLIs and SLOs that help in assessing whether the Error Budget is within limits.
MTTR (mean-time-to-recovery) or TTM (time-to-mitigate) is of prime concern in the case of active-passive. These help in gauging whether the recovery time objective (RTO) and recovery point objective (RPO) is being met.
There are some common attributes of interest too, such as the mean-time-to-detect failures (MTTD) and other expectations around observability (efficiency of alerts, logs, and other debug aids within the platform).

Chaos experimentation as part of DRP is often conducted as gamedays (also called fire drills) with multiple stakeholders participating. The chaos scenarios implemented within these gamedays are expected to increase in blast-radius configuration as the recovery paths are solidified.

Start Evolving your IT Disaster Recovery to Continuous Resilience with Harness Chaos Engineering

Chaos engineering helps organizations minimize financial and reputational impact associated with unplanned downtime. It also enables developers to focus on software delivery rather than fire-fighting production incidents.

Chaos experiments go beyond traditional unit, integration, and system tests, and more closely represent what random failures in a real-world, production environment would look like. This realistic environment provides insight into how systems behave, equipping teams to understand weak links that exist in the applications and infrastructure, and proactively creating resilience to help prevent costly downtime.

The Harness Chaos Engineering (CE) module helps engineering and reliability teams navigate the risks of unplanned downtime by helping them identify system weaknesses and improve reliability by purposely creating failure scenarios (i.e., chaos).

Getting started with chaos engineering has never been so simple. If you are ready to see how your organization can adopt this practice and start improving reliability, request a Harness CE demo or start your SaaS trial today!