Chaos Engineering Features

Chaos Platform

Centralized Chaos Portal

A unified UI to manage all chaos engineering activities across all infrastructures and environments. Orchestrate reliability tests across multiple targets, report across services, and safely control any impacts observed from chaos experiments. 

SaaS

Chaos-as-a-Service enables you to deploy chaos experiments to your environment quickly, safely, and securely while Harness manages the chaos infrastructure for you. 

Self-hosted

Quickly deploy a self-hosted, on-premises deployment to your data center or to a cloud provider-managed service.

Air-gapped

You can leverage the air-gapped deployment capability if your organization requires complete isolation from the Internet. Teams can run experiments without connecting to external resources or dependencies.

Experiments and ChaosHubs

Experiments

A chaos experiment is a set of chaos faults coupled together to achieve the desired impact on the system. With experiments, users can observe real-time data and the status of the chaos experiments. Users can also view valuable information like pod logs, chaos experiment status, and chaos results.

Enterprise ChaosHubs

Harness Chaos Engineering’s Enterprise ChaosHub is a catalog of advanced experiments across multiple cloud providers and platforms, including VMware, AWS, GCP, Azure, and full coverage of Kubernetes chaos experiments. Users can manage, edit, schedule, and run experiments within the UI. Users can also create experiments, contribute to their organization’s ChaosHub, and share with other teams.

Create Your Own ChaosHubs

Users can create a ChaosHub to orchestrate experiments from an alternate public or private source. With a ChaosHub, you can construct your chaos experiments by selecting, tuning, and sequencing faults. You can make changes in your repositories and sync them with the Chaos Portal. Leveraging a public ChaosHub enables you to contribute experiments and ideas to the LitmusChaos open-source community. Private ChaosHubs keep your source of experiments specific to your organization.

Observability

Metrics

Harness Chaos Engineering facilitates real-time monitoring of events and metrics using its native chaos exporter to Prometheus. Developers use the events and metrics exported into time-series databases to overlay on top of application performance graphs to correlate impact to system performance and provide additional visualizations for chaos testing statistics and reporting.

Steady State Measurement

Probes

Probes provide the ability to understand the experiment's impact on the system or other dependent systems and measure the steady state of your system during chaos experiments. Probes help give you feedback from the experiment, so you don’t have to monitor or observe impacts on tested systems manually.

In addition, you can leverage probes to abort the experiment when test conditions have been met and execute a script to kick off a load test or recover the system. Probes can continuously perform requests for verification.

Types of Probes

HTTP 

The HTTP probe queries the health response from a service through a URI (i.e., HTTP 200, 500) to measure the impact of the chaos experiment.

CMD

The CMD probe executes health check functions implemented as a shell command and enables specific custom experiments. Users can leverage this probe to start another process or run a script that recovers the system if it fails.

Kubernetes

The Kubernetes probe executes CRUD operations against native and custom Kubernetes resources. Users can read and write resource changes to Kubernetes based on the experiment's results, such as configuring a YAML file to tune the system's reliability.

Prom

The Prom probe executes a Prometheus query (promql) and matches Prometheus metrics for specific criteria. This probe enables a user to report on native metrics that happen during the experiment.

Modes of Probes

Start of Test (SOT) 

Executed as a pre-chaos fault check at the start of an experiment. This mode enables you to get one measurement at the beginning of the test, which you can use for a status to continue or abort the experiment.

End of Test (EOT)

Executed as a post-chaos fault check at the end of an experiment. Get one measurement at the end of the test to help define if the experiment was a success or failure.

Edge

Executed before and after the chaos fault during an experiment to get a snapshot of system health to understand changes in your environment during the test.

Continuous

Executed continuously, with a specified polling interval during the experiment. This mode enables continuous feedback from your system during an experiment.

OnChaos

Executed continuously, with a specified polling interval strictly for the duration of the chaos fault. This mode enables you to get feedback during the actual impact of the experiment.

Experiment Control Methods

Declarative Chaos Experiments

Experiment execution through CRUD and YAML-based support enables users to define configuration in a code repository, version, and edit through automation. This declarative approach empowers developers to build and automate reliability in their code.

Run Chaos Faults in Parallel

Complex IT outages often stem from multiple failure modes. A simple network disruption turns into a rolling failure across numerous services. Harness Chaos Engineering enables you to run faults in parallel (CPU fault + Memory fault) to mimic real-world events.

Running Chaos Experiments in Parallel

Run various experiments on different targets to simulate cascading failure across more extensive sets of services. This ability enables you to cause a network disruption on one cloud provider’s availability zone and simultaneously run a resource exhaustion experiment, simulating traffic moving over to the redundant system.

Chaos Experiments for Multiple Kubernetes Clusters

Experiments can run on multiple Kubernetes clusters without creating various sets of experiments. This flexibility allows for precise control of large services that span multiple clusters.

Ability to Abort an Inflight Chaos Fault and Experiment

An essential feature for safety is the ability to abort a fault or experiment that causes an impact beyond the desired test expectation. Users can manually or automatically set up abort conditions using probes defined with the tested system's health metrics and automate recovery scripts.

Chaos Orchestration and Reliability Management

Event-Driven Chaos Injection

GitOps enables you to configure a single source of truth for chaos experiments. Any changes to artifacts stored in the configured git repository allow you to create and execute chaos experiments directly through git automation in CI/CD pipelines.

GitOps also uses event-driven chaos injection in which target resources can be configured to automatically trigger chaos experiments with any changes in the resource spec. Currently, the event supported for chaos injection is resource image change, configuration change, change in replicas, and many more. The event-driven chaos injection allows Harness to be integrated with traditional GitOps flow that involves automated deployment of applications or workloads. Chaos experiments can be automatically triggered based on your organization’s needs.

Schedule Chaos Experiments Directly from a ChaosHub

Users can automate a chaos experiment through the Harness Chaos Engineering scheduler, which leverages a cron job based on the frequency, duration, and impact setup in the experiment. This can be used to run a daily experiment to validate that your application can withstand the common failure mode of an application restart.

Reliability Management with a Resiliency Score

A resiliency score measures your experiment's resilience, considering all the chaos experiments and their result points. Harness Chaos Engineering enables teams to define, measure, tune, and customize each experiment to track resiliency over time and automate experiment results. Each experiment can have different weights assigned to signify low, medium, and high priority tests. Leveraging a defined score with a consistently-executed experiment gives you trends on health metrics on how system behavior during failure events. Signals can be sent to users based on changes to typical scores.

GameDay Portal

A GameDay is a series of experiments that serves a purpose, such as:

  • New engineer training for on-call rotation
  • Exploring unknown failure modes in a system
  • Migrating a system to a new technology that enables education of that technology
  • Incident re-creation to validate code fixes
  • Validation of Disaster Recovery exercises

The Harness Chaos Engineering platform’s GameDay feature constructs experiments to test with a team. Your GameDay is repeatable by defining it as a template. The feature enables a user to start, stop, and re-run experiments within one UI, allowing a team to test in small increments of failure. The team can also take notes and observations and create a checklist of tasks they need to complete, which can be added to a ticketing system.

Chaos Integration

Integrated with Harness Continuous Delivery (CD)

Harness Chaos Engineering tightly integrates with the CD workflow, allowing you to see, govern and build automation that spans from pre-production to production deployments. Experiments can be automatically triggered to verify changes without impacting system reliability. This integration also provides end-to-end analytics and the creation of custom dashboard views spanning the deployment to release stages.

Integrated with Harness Continuous Verification (CV)

Harness CE, CD, and CV automate the release testing process through artificial intelligence (AI) and machine learning (ML). This verification automatically validates deployment quality to reduce risk and provide a safety net when code deploys to production.

A user can configure a CE stage within a CD pipeline to validate specific reliability metrics aren’t impacted and leverage CV to identify normal application behavior. The pipeline will identify and flag anomalies in future deployments and perform automatic rollbacks when reliability metrics are affected.

This integration allows engineering teams to automate reliability testing, verify risk and perform actions before moving to the next stage in a deployment.

Chaos Faults

Kubernetes

Harness CE has the most robust experiments for Kubernetes, which enables a developer, QA engineer, SRE, or platform engineer to ensure the proper operation of their application. Examples include:

Pod Delete

Disrupts the state of Kubernetes resources. Experiments can inject random pod delete failures against targeted applications.

  • Causes (forced/graceful) pod failure of random replicas of an application deployment
  • Tests deployment resiliency (replica availability) and recovery workflows of the application pod

Pod CPU Hog

Consume CPU resources of specified containers in Kubernetes pods.

  • Tests the application's resilience to slowness and unavailability of some replicas due to high CPU load
  • The application pod should be healthy once the faults stop and serve requests despite faults

Cloud Providers

Harness has out-of-the-box AWS, GCP, and Azure experiments that enable a team to disrupt AWS infrastructure to validate a system’s resilience against network, resource, disk, and server failure. This ability allows users to build a chaos engineering program across multiple cloud providers for a holistic reliability practice. Examples include:

AWS ELB Availability Zone Done

The fault causes detachment of the target availability zone(s) from an ELB that might result in service (associated with target ELB) hindrance/unavailability for the specific zone for a particular duration.

  • Help validate that their system auto-recovers from the loss of an availability zone or the on-call engineer is notified of the failure

GCP VM Disk Loss

Causes loss of a non-boot storage persistent disk from a GCP VM instance for a specified duration before attaching them back.

  • Ensure your application can handle a loss of disk by validating that a message broker system queues up messages that can replay upon disk restoration

Azure Instance Memory Hog

Azure instance memory hog contains chaos to disrupt the state of the Azure instance. Experiments can inject a memory hog on the target Azure instances.

  • Validate the application's performance running on the Azure instance when subject to increased memory

VMware

Harness Chaos Engineering has a complete set of VMware-specific experiments that enable a user to validate a system’s resilience against network, resource, disk, and server failures. Examples include:

VMware Service Stop

Contains chaos that stops services in the target VMware virtual machine (VM) for a specific duration, which allows the user to verify the resiliency of the process or application running on the targeted VMs.

VMware Network Latency

Contains chaos to disrupt network connectivity of the VM(s). Causes flaky access to the application/services by injecting network delay.

Chaos Experiment SDKs

The SDK provides a simple way to bootstrap your experiment and helps create the experiment artifacts in the appropriate directory based on an attributes file provided as input by the developer. The scaffolded file templates consist of placeholders that can be completed.

Supported SDK libraries include:

Administration

Comprehensive APIs

Harness provides well-documented REST APIs for automation across our entire platform, including onboarding new teams and projects at scale. Our APIs enable you to automate and manage your end-to-end workflow and integrate Harness into your existing tooling.

Built-in User Management and Authentication

Harness provides built-in access control features, including authentication, authorization, and auditing. It also allows you to enforce password policies, such as password strength, periodically expiring passwords, and enforcing two-factor authentication.

Provisioning Users with Okta (SCIM)

Harness makes it easy to provision users with Okta. Using Okta as your identity provider, you can efficiently provision and manage users in your Harness Account, Org and Project. Harness' SCIM integration enables Okta to serve as a single identity manager for adding and removing users and provisioning User Groups. This feature is especially efficient for managing many users.

Provision Azure AD Users and Groups (SCIM)

By using Azure AD as your identity provider, you can efficiently provision and manage users in your Harness Account, Org and Project. Harness' SCIM integration enables Azure AD to serve as a single identity manager for adding and removing users and provisioning User Groups. This integration makes it more efficient when managing large numbers of users.

Provision Users and Groups with OneLogin (SCIM)

You can use OneLogin to provision users and groups in Harness. Harness' SCIM integration enables OneLogin to serve as a single identity manager for adding and removing users. This feature is especially efficient for managing large numbers of users.

Multiple Projects and Organizations

Manage numerous projects for business units or divisions efficiently in Harness. Harness Organizations (Orgs) allow you to group projects that share the same goal. A Harness Project is a group of Harness modules and their Pipelines. You can add an unlimited number of Harness Projects to an Org. All Projects in the Org can use the organization's resources.

Security

Single Sign-On (SSO) with OAuth 2.0

Harness supports Single Sign-On (SSO) with OAuth 2.0 identity providers, such as GitHub, Bitbucket, GitLab, LinkedIn, Google, and Azure. These integrations allow you to use an OAuth 2.0 provider to authenticate your Harness Users. Once OAuth 2.0 SSO is enabled, Harness Users can simply log into Harness using their GitHub, Google, or other provider's email address.

Single Sign-On (SSO) with SAML

Harness supports Single Sign-On (SSO) with SAML, integrating with your SAML SSO provider so you can log your users into Harness as part of your SSO infrastructure. 

Single Sign-On (SSO) with LDAP

Harness supports SSO with LDAP implementations, including Active Directory and OpenLDAP. Integrating Harness with your LDAP directory enables you to log your LDAP users into Harness as part of Harness' SSO infrastructure. Once you integrate your Harness account with LDAP, you can create a Harness User Group and sync it with your LDAP directory users and groups. The users in your LDAP directory can then log into Harness using their LDAP emails and passwords.

Two-Factor Authentication (2FA)

Harness provides support for 2FA throughout the Harness Software Delivery Platform, with enforcement both at the individual user account level and at the account-wide (all accounts) level. 2FA setup with Harness is easy, using a smartphone-based process using QR codes for initial setup and username/password for all subsequent logins once configured. 

Audit Trail (Two-Years Data Retention)

Harness Audit Trails provide the visibility needed to meet organizational governance needs and prepare for external audits. With Harness Audit Trails, you can view and track changes to your Harness resources within your Harness account with data stored up to two years prior. Without this data, developers are forced to compile information for audits manually.

Governance 

Policy-Based Governance (OPA)

Harness Policy as Code is a centralized policy management and rules service that leverages the Open Policy Agent (OPA) to meet compliance requirements across software delivery and enforce governance policies. Policies are written as declarative code, so they are easy to understand and modify, enabling teams to have autonomy over their processes with oversight and guardrails to prevent them from straying from standards. Teams can use Policy-as-Code to implement global governance policies across all releases and combine with pipeline governance for policies to be implemented per release.

RBAC (Role-based Access Control) - Built-in Roles and Custom Roles

Harness provides fine-grained RBAC to enforce the separation of duties and control what user groups are granted access to specific resources based on assigned roles. RBAC allows businesses to protect their data and critical processes through rules and roles. Built-in roles are available by default to quickly create the desired permissions at the account, organization, and project level within Harness, as well as create custom roles for additional flexibility based on business needs that fall outside the scope provided by default roles. 

Pipeline Governance

Pipeline Governance measures how compliant your feature release pipelines are compared to your regulatory and operations standards. As a deployment pipeline is triggered within Harness, the deployment can require approval before releasing to production based on a “score” that indicates how compliant a given pipeline is - this “score” is made up of individual weighted tags that, together, determine the level of compliance. 

Integrated Secrets Management

Harness Chaos Engineering utilizes secrets for chaos infrastructure that can leverage the platform secrets management feature, so users don’t have to store that information on their laptops. This feature can be incorporated into a third-party secrets management tool.

IP Address Allowlist Management

Harness Chaos Engineering can be configured with an IP address allowlist to ensure only specific targets can have experiments run against them. This functionality enables a team to specifically allow or deny targets that are not intended to have chaos experiments run on them.

Hosting

Harness provides a flexible hosting model that allows for full SaaS implementations, full on-premise implementations, and hybrid implementations. These models allow companies with a variety of security requirements to use Harness Chaos Engineering.

Hosting includes:

  • SaaS or on-premises deployment options available
  • Automatic backups and disaster recovery 
  • SLA guarantee

Support

Community

Those using open source and source-available products from Harness can access community.harness.io. Both Harness staff and users contribute to our knowledge base.

Standard

Harness standard support for all Harness customers on a paid contract includes coverage from 9 AM to 5 PM PT Monday through Friday, with response times indicated in the table below based on the severity of the need. Support entitlements are provided for two named admins for each customer. 

Premier 

Harness premier support for all Harness customers on a paid contract includes coverage 24 hours a day, seven days a week, with response times indicated in the table below based on the severity of the need. Also included at the Premier support level are Zoom-based communication and post-incident reports. Support entitlements are provided to all customer staff.