November 5, 2024

Use Chaos Engineering to Implement Cost Savings Strategies on AWS

Table of Contents

Chaos Engineering provides a safe way to validate AWS cost-saving measures, ensuring they don’t compromise system resilience or performance. Harness Chaos Engineering includes experiments for testing auto-scaling, Spot Instance compatibility, and critical resource dependency, supporting informed cloud optimization. By simulating failures, teams gain confidence that cost-saving changes won’t disrupt customer experience, ultimately maximizing AWS investment.

In this blog, we will discuss the top 5 FinOps strategies on AWS used by organizations to optimize their cloud spend and how chaos engineering can help you validate your systems' resilience when you are trying to downsize or remove certain resources.

Top 5 FinOps Strategies Used On AWS to Optimize Cloud Spend

  1. Identifying unused resources and terminating them.
  2. Choosing Optimized Instances (On Demand vs Spot vs Reserved Instances).
  3. Rightsize your compute resources.
  4. Release Idle Elastic IP Addresses.
  5. Delete Unused EBS snapshots.

Cost Optimization is so critical that it is one of the pillars of the AWS Well-Architected Framework. Before we right-size our resources or terminate unused resources, it is very important to understand the impact of our optimization on our applications. This is where Chaos engineering can help you test the resilience of your system when you downsize compute resources or terminate some resources.

Chaos Engineering is the discipline of performing experiments on software systems to build confidence in their ability to withstand turbulent and unexpected conditions. In other words, failures are intentionally injected into applications to strengthen resilience. By proactively introducing controlled chaos, you can confidently validate your optimization strategies to reduce your cloud spending without impacting your end customers.

We will look into three areas where Chaos engineering can help reduce your cloud spend.

  1. Rightsizing the compute resources for your instance
  2. Switching between On-demand vs Spot Instances
  3. Identifying unused resources and terminating them

1: Rightsizing the Compute Resources

When it comes to instances, AWS provides a lot of different instances that have different CPU architecture, CPU, Memory, storage, etc. As an org, it is left to our teams to identify the instance types that will serve the needs of your workloads/applications. Most of the time, it is not always easy to pick the right instance type, and the teams often choose the larger instance types, which leads to a waste of compute resources and thus leads to an increase in cost.

The most effective strategy is to use auto-scaling options while configuring an EC2 instance to meet the increased demand for traffic and cut back during low-traffic scenarios.

Test Auto Scaling With Chaos Engineering

Let’s assume that you have configured autoscaling for your EC2 instances and want to simulate how your application behaves once the CPU, Memory, and I/O resources are under stress.

To simulate these scenarios, Harness Chaos Engineering provides 3 out of three out-of-the-box experiments to test the autoscaling behavior.

Experiments

EC2 CPU hog | Harness Developer Hub

EC2 memory hog | Harness Developer Hub

EC2 IO stress | Harness Developer Hub

Once you run the above tests, you can test whether your instances autoscale when the resources exceed a certain threshold. For example, when the CPU utilization is greater than 80%, your instance scales up to meet the increased load and then returns to a minimum. This proves that you don’t have to go for large instances but use smaller instances with autoscaling to scale up and down your workload as necessary.

2: Switching Between On-demand vs Spot Instances

On AWS, whenever you create a new EC2 instance by default, they are launched as an on-demand instance, meaning you are charged a fixed hourly rate as long as the instance runs. On-demand instances are ideal for stateful workloads without requiring long-term commitments. However, Spot Instances offer a more economical option for stateless workloads or those that can handle intermittent availability.

Spot Instances utilize unused EC2 capacity at a significantly lower cost—up to 90% cheaper than on-demand instances—making them a more budget-friendly choice, even compared to reserved instances (RIs) Unlike RIs, which require a significant upfront investment, Spot Instances use the same pay-as-you-go model as on-demand instances.

The trade-off is that AWS can terminate your Spot Instance anytime when it needs the capacity back. AWS provides a two-minute warning before stopping the instance, giving you a brief window to safely stop workloads and transfer data. Additionally, Spot Instance pricing fluctuates with demand, meaning costs may vary over time.

Validate Spot Instances Compatibility With Chaos Engineering

Suppose an application runs on an EC2 on-demand instance that is distributed and replicated.

As you know, AWS can terminate your Spot instances at any time. So, you can use Harness Chaos to run a similar scenario by restarting the on-demand EC2 instance to validate if your application meets your threshold criteria. Suppose your application can still withstand the termination of an example because it is distributed and follows a well-architected framework. In that case, you can consider those EC2 instances to be a spot Instance to reduce your cloud spending without compromising on the performance of your application. Harness Chaos Engineering provides an out-of-the-box experiment to test the scenario as mentioned below.

Experiments

EC2 process kill | Harness Developer Hub

3: Identifying Unused Resources and Terminating Them

We have often observed that organizations accumulate unused infrastructure over time, leading to unnecessary costs. Instances created for testing or migrations may be forgotten, or artifacts from configuration scripts may linger without documentation.

However, some of these instances may be critical, and shutting them down could disrupt services. To avoid outages, Chaos Engineering can be used to test their importance by throttling or blocking network traffic to these instances and observing any impact on applications.

Testing Essential and Utilized Instances with Chaos Engineering

If your instance is really critical to your application, then throttling or blocking the network will immediately affect its performance and functionality. Harness Chaos provides 2 out-of-the-box experiments to test the scenarios mentioned below.

Experiments

EC2 network latency | Harness Developer Hub

EC2 network loss | Harness Developer Hub

After running the above tests, if we find that our application performs as expected, the instance on which network latency and network loss were performed is not essential to our application and can be terminated, thus reducing your cloud spend.

Conclusion

For more detailed use cases and technical documentation, please refer to the Chaos Engineering Docs.

Sign up for FREE to experience the ease of resilience verification using chaos experiments. The free plan allows you to run a few chaos experiments at no charge for an unlimited time, boosting Chaos Engineering for the community.

Harness Chaos Engineering ROI Calculator helps estimate business losses from outages and evaluates the ROI of chaos engineering practices. Simulating failures and optimizing recovery improves system reliability and reduces downtime, providing a clear financial benefit to organizations.

Chaos Engineering