Production Incident Management

What is a Production Incident?

A production incident, commonly known as an "incident," is an unexpected event or problem that arises within our live production environments, resulting in either complete or partial service disruptions. In the case of a partial incident, it renders one or more functions of a module nonfunctional or inaccessible.

All production incidents with high blast radius or impact are posted on our status page (https://status.harness.io) and our users can subscribe to the feeds from this site to get notified.

These major incidents follow an escalated all-hands-on-deck process with shorter timeframes and higher urgency that is required to accelerate the resolution process.

Prod Incident Criteria:

  • P0 Incident means Harness is down or is unusable for 5+ customers in a specific cluster
  • P1 Incident means a major Harness feature/function is not available to any of our users including regressions (new releases breaking existing behavior)

Here is a general workflow for how an Incident is managed at Harness.

How are incidents identified and triaged?

Incidents can be detected either through our internal alerting or by an external user reporting to CX.

Internal Alerts:

All Harness services have a FireHydrant configured to create P0/P1 incidents with following criteria:

  • ART goes above the threshold
  • SLI Indicators
  • Synthetic job failures
  • GCP alerts

Services may use other alerting systems such as:

  • Grafana/Prometheus

Once a Prod incident is detected, a War Room is created immediately and on-call Service Owners and Engineering Leaders collaborate to triage and drive resolution on priority.

External Reporting

Any Harness customer is encouraged to report (by opening a P0 or P1 support ticket and/or escalating it through cx-escalation@harness.io) an issue if they believe there is a production issue qualifying as a P0 or P1

Harness CX Engineers initiate incidents for urgent issues reported by customers. (Team leads are responsible for ensuring the on-call rotation is suitable for their team's time zones, workloads, and expertise).

A customer MUST submit a self-detected incident as an URGENT issue in their app.harness.io by selecting “Harness is completely down and unusable (Priority: Urgent)”

Once submitted, our on-call rotation will be paged and an initial response is guaranteed per the SLA outlined in: Support Tiers and Definitions

Upon initial triage by Harness Engineering, if it is found that the reported issue does not constitute a P0 or P1, we will proceed with the support triaging workflow/process after discussing and obtaining approval from the user who reported the issue.

When customers are manually reporting we encourage sharing as much explicit detail as possible to accelerate the initial triage efforts.

  • Relevant logs (if not already sent to Harness)
  • Pipeline execution URL
  • Any changes that may have been made
  • Business impact and urgency
  • Steps to reproduce

Major Incident and Platform Availability

Anyone can view the current availability status and historical data of the Harness application, as well as subscribe to receive availability alerts through status.harness.io.

Harness tracks the status of multiple services in StatusPage across our multiple Prod Clusters.

Harness' uptime SLA calculation methods can be found here: Computing uptime for Harness Modules | Harness Developer Hub.

Status Page Workflow

Harness may learn about an incident from either external customer report or internal alerting.

Once the report is confirmed and it turns out to be a major incident with a high blast radius, a message will be posted to Status Page as Investigating and transition to Identified once the root cause is known and mitigation is in progress. Once the issue is remediated it transitions to Monitoring.

Once the incident has been in Monitoring status for at least an hour, it moves to Resolved. Harness downtime SLA stops once the incident is patched/fixed and the status is set to Monitoring.

Postmortem

Harness will publish a Root Cause Analysis (RCA) on the status page's postmortem section within 48 hours after resolving an incident.