Introducing Harness Service Reliability Management

Authors:

Table of Contents

Today, we are announcing a new module in the Harness Software Delivery Platform that helps developers maintain high velocity while continuously improving the reliability of application services. Harness Service Reliability Management (SRM) was designed to improve the collaboration and governance between engineering and reliability teams so that they can adopt a modern Site Reliability Engineering (SRE) program using Service Level Objectives (SLO) as outlined by Google in the SRE Handbook.

Harness Service Reliability Management is for teams that want a better way to balance the velocity of feature releases and bug fixes with the stability and reliability needs of a production environment. With Harness SRM, you no longer need to choose between velocity and confidence. It helps you ensure that your best developers continue to deliver highly reliable software at high velocity while putting guardrails in place for other developers or sensitive projects.

Adopting SLO-Driven Software Delivery

There is a cultural shift AND new knowledge/skills required to make the SRE model successful in your organization. Harness SRM was designed to help companies of all sizes rapidly adopt and implement an SRE model while avoiding these common challenges:

Wasted time manually tracking SLOs and error budgets.
Conflict between engineering and reliability teams due to lack of collaboration defining governance.
Engineering teams are surprised with work stoppages because they don’t have visibility into SLOs and error budgets.
Work stoppage negotiations between engineering and reliability teams on a service-by-service basis.
Trouble scaling site reliability engineering practices.
Problems maintaining high feature delivery velocity while also ensuring high reliability.

Achieving Excellence in SLO-Driven Software Delivery

Harness SRM is a solution for engineering AND reliability teams. Within SRM, teams collaborate to define SLIs, SLOs, and Error Budgets. SRM users also create reliability guardrails within their CI/CD pipelines. These reliability guardrails determine whether or not pipelines are allowed to proceed to the next stage. SLO and Error Budget data is used to drive the behavior of the reliability guardrails. If SLOs are violated too often, Error Budgets become depleted, which causes the reliability guardrails to stop pipeline execution. Once pipeline execution is stopped, explicit approval must be provided for pipelines to proceed. This is all tracked in the SRM audit log for compliance purposes.

To promote better production reliability, Service Reliability Checks are performed across all stages of the software delivery lifecycle. Some of these reliability checks, like native error tracking from Harness, require an agent to be added to the application service. All other reliability checks are performed via integrations to external tools (APM, log analytics, testing, etc.). The goal of these checks is to identify as many reliability issues as possible before production. If done properly, production reliability will continually improve.

Conclusion - Frenemies No More!

Reliability and engineering teams don’t want to be in conflict with each other, and now with Harness SRM, they don’t need to be. They can evolve to a new collaborative relationship where both teams work in harmony to deliver software faster with the confidence that it will be reliable.

Interested in learning more or getting started with Harness Service Reliability Management? Click here for more information.

Introducing Harness Service Reliability Management

Similar Blogs

State of the

Developer Experience 2024