January 26, 2023

Harness Chaos Engineering (CE) Key Capabilities

Table of Contents

Harness Chaos Engineering provides robust features like CI/CD integration, unified experimentation across platforms, automated steady state measurement, and advanced observability. These capabilities help organizations improve system reliability, streamline chaos testing, and minimize downtime, ensuring a resilient infrastructure.

Chaos engineering helps organizations minimize unplanned downtime's financial and reputational impact. It also lets developers focus on software delivery rather than fire-fighting production incidents. Chaos experiments go beyond traditional unit, integration, and system tests and more closely represent random failures in a real-world production environment. This realistic environment provides insight into how systems behave, equipping teams to understand applications' and infrastructure weaknesses and proactively creating resilience to help prevent costly downtime. This blog will look closely at the product’s key capabilities to see how it helps teams solve these challenges.

Harness CE provides: 

  • Chaos orchestration in CI/CD pipelines for Continuous ResilienceTM
  • Unified experimentation across cloud providers and self-hosted platforms
  • Steady statement management for baselining and improving reliability
  • Observability and ecosystem integration for visibility
  • Robust experiment control methods for safe testing and automatic recovery rollbacks
  • GameDay portal to proactively train your on-call team for incident response
  • Enterprise dashboards, analytics, logs, and reports for clear communication
  • Enterprise-grade audit trail and role-based access control (RBAC) for security
  • Enterprise support to help businesses scale the practice quickly

Let’s dive deeper into the capabilities that teams can leverage to increase reliability.

Harness Chaos Engineering workflow

Chaos Orchestration in CI/CD Pipelines

Achieve Continuous ResilienceTM with the native platform integration with Harness CE and Continuous Delivery (CD). Powered by the CNCF project, LitmusChaos, this integration makes it easier for Developers and SREs to test the reliability and resilience of applications in software delivery pipelines to improve overall reliability and minimize the risk of unplanned downtime.

Chaos orchestration in CI/CD pipelines

Unified Experimentation Platform

Implement chaos engineering using our SaaS, self-hosted, on-premises, or air-gapped deployments to align with your business and security requirements. Harness supports injecting experiments into multiple platforms and environments. The Enterprise ChaosHub is a catalog of advanced experiments with coverage across VMware, AWS, GCP, Azure, Serverless and a full range of Kubernetes chaos experiments. Chaos experiments enable users to manage, edit, schedule, and run experiments within the UI for improved collaboration. Harness provides the largest and most diverse chaos experiments available today, with many more added monthly.

Chaos Orchestration and Reliability Management

Chaos orchestration enables users to build a CE practice quickly by letting the Harness solution fill the gaps in the organization's knowledge, processes, and tools. Utilize Harness CE to train new and existing employees to level everyone up on software reliability. 

Roll out chaos engineering to the entire enterprise from a Git repository instead of waiting years to adopt the CE practice team by team. Start your entire enterprise on the chaos engineering practice to scale software reliability to every application. Leverage GitOps and CI/CD integrations to automate the complexity and meet developers where they are by providing declarative YAML files for chaos experiments that improve the developer experience.

The GitOps feature enables you to configure a single source of truth for your chaos experiments and execute them directly from Git, allowing a vast scope of automation in CI/CD pipelines.

A team can manage reliability through the resilience score to define, measure, and tune each experiment to track resiliency over time and automate experiment results.

Steady State Measurement

Rather than have developers manually look at monitoring dashboards and have “eyes on glass” with multiple browser tabs open, Harness CE provides probes that can automate the experiment's measurement. Probes are editable checks you can define for any chaos experiment to measure an experiment's success and failure conditions. Chaos Probe examples include simple querying of application health checks and system steady state metrics.

Harness Chaos Engineering Steady State Measurement

GameDay Portal

A GameDay is a series of experiments that serves a purpose, such as:

  • New engineer training for on-call rotation
  • Exploring unknown failure modes in a system
  • Migrating a system to a new technology that enables education of that technology
  • Incident re-creation to validate code fixes
  • Validation of Disaster Recovery exercises

The Harness Chaos Engineering platform’s GameDay feature constructs experiments to test with a team. Your GameDay is repeatable by defining it as a template. The feature enables a user to start, stop, and re-run experiments within one UI, allowing a team to test in small increments of failure. The team can also take notes and observations and create a checklist of tasks they need to complete, which can be added to a ticketing system.

Experiment Control Methods

Harness provides declarative chaos experiments to define configuration in a code repository, version, and edit through automation. This declarative approach empowers developers to build and automate reliability in their code.

Harness chaos engineering enables you to run faults in parallel (CPU fault + Memory fault) to mimic real-world events. In addition to this approach, you can run chaos experiments in parallel to model complex IT outages that often stem from multiple failure modes.

Run various experiments on different targets to simulate cascading failure across more extensive sets of services. This ability enables you to cause a network disruption on one cloud provider’s availability zone and simultaneously run a resource exhaustion experiment, simulating traffic moving over to the redundant system.

Lastly, you can abort an inflight experiment that causes an impact beyond the desired test expectation. Users can manually or automatically set up abort conditions using probes defined with the tested system's health metrics and automate recovery scripts.

Observability and Ecosystem Integrations

Harness CE can send chaos metrics to popular observability and application performance monitoring (APM) solutions that enable developers to integrate with their ecosystem of reliability. This reduces developer toil because Harness CE can plug into their system. Our list includes Prometheus, Grafana, Dynatrace, Keptn, and more. Besides observability and monitoring integrations, you can integrate with load-testing tools or leverage your own test with a custom script.

Enterprise Dashboards, Analytics, and Reports 

Different roles require additional views regarding dashboards and reports. Executives might want a high-level risk assessment on a single dashboard. An engineering manager might want to see the reliability status of all services. Regardless, Harness CE has all the experiment data, analytics, and reporting capabilities needed to be the centralized source for reliability.

Enterprise-Grade Audit Trails and RBAC

Harness has built a reputation in the CI/CD industry for having detailed audit trails and fine-grained RBAC. These audit trails make it quick and easy for engineering teams to pass audits, often turning what would be days of effort into just a few hours. Our fine-grained RBAC model means that you can implement a permissions system that meets your organization's needs - no matter how complex.

Enterprise Support 

Harness recognizes that enterprises need to move fast and scale quickly to meet the demands of their business, so we’re equipped to offer enterprise support to ensure your chaos engineering practice can begin as quickly and safely as possible. Harness CE was built by the same team of experts that created the CNCF open-source project, LitmusChaos. This team is ready to support SaaS, on-premises, self-hosted, or air-gapped installations and provide onboarding assistance, feature enhancements, chaos best practices, and custom tooling integration for CI/CD and observability platforms.

Start Improving Software Reliability Today with Harness Chaos Engineering

Getting started with chaos engineering has never been so simple. If you are ready to see how your organization can adopt this practice and improve reliability, request a demo and sign up for the SaaS trial today!

Chaos Engineering