Delivering Reliability Through SRE Practices

Authors:

Table of Contents

Implementing Site Reliability Engineering (SRE) practices enhances continuous delivery by ensuring software is both innovative and reliable, through on-call playbooks, canary deployments, and monitoring key health metrics like mean time to restore and change failure rate.

When we talk about continuous delivery, the part that gets left out is how we deliver sustainably and repeatedly. Reliability is everyone’s responsibility, but sometimes this statement can feel like it is at odds with innovating and delivering features quickly. Site Reliability Engineering (SRE) has been a hot topic this year as more guides around the role were shared, and more people stepped into this role than ever before.

When discussing Software Delivery, it’s crucial to discuss SRE and introduce software design implementation and maintenance practices to achieve both innovation and reliability work successfully. This blog shares some popular SRE practices and how to apply them to enable continuous delivery.

Being Available Before, After, and During an Incident.

Site Reliability Engineers are responsible for being available during an incident that means responding, explaining, and retrospecting different aspects of an incident that occurs within an organization. This can involve reviewing production workflows, alert criteria or triggers, and human processes surrounding a deployment.

One way SREs can better respond and sustain software delivery is by following an on-call playbook. An on-call playbook is a guide on how to respond to an event. It’s often a template that automatically generates a ticket with information regarding the severity of the trigger alert, debugging suggestions and actions to mitigate the impact of an incident.

Another common practice is to ensure that post-mortems promote continuous improvement, influence product management, and increase visibility into action items captured during a retrospective. It’s common for SREs to follow up and publish with post-mortem improvements following an incident.

Being on call is not easy. It comes down to thinking about the people, process, and technology involved in our delivery process. When reviewing incidents, common areas to focus on include monitoring and metrics, TOIL and pager load, and the service application itself (whether the incident was caused by a new or an existing bug.)

Defining How Code Gets into Production

The release engineering process defines how code gets into production. For SREs, this can mean defining processes, reviewing artifacts, and owning CI/CD pipelines. Release engineering is about minimizing risk, improving tempo, and automating manual processes that prevent software delivery from being repeatable.

One practice to consider in release engineering is introducing Canary deployments or forms of progressive delivery. Progressive delivery shifts a subset of user traffic from an existing service to a newly deployed service. It’s worth considering this capability around release engineering. Canary deployments might be useful for critical services where an incident could spell the end of a business. Many solutions exist in the space that help integrate canary deployment capabilities into the release engineering process.

Managing Reliability

The third area that site reliability engineering focuses on involves budgeting errors and owning the reliability of an application. Practices include setting SLAs, measuring latency or performance, and improving the monitoring of an application. It’s common for an SRE to block production releases if an application team violates a specific error threshold.

There can be many indicators of poor reliability, including low availability and poor health delivery metrics. Common delivery health metrics include mean time to restore, change failure rate, lead time to production, and deployment frequency. When considering metrics look at change over time, the impact of technology solutions, and these differences across teams and services to track and validate specific outcomes.

"Continuous Delivery is the ability to get changes of all types into production, or into the hands of users, safely, and quickly in a sustainable way." -- Jez Humble

Successful software is ready, stable, agile, and valuable for users. It’s fairly common for an SRE to flag behaviors or outputs that violate any of these four key focus areas. This can involve blocked production releases if an application team violates a specific error threshold. SREs provide an opportunity for app teams to review what is needed to improve an application’s performance and reliability.

Site Reliability Engineering as the Catalyst for Better Software Development and Operations.

There’s an important opportunity that site reliability engineers and site reliability engineering practices have on the continuous delivery lifecycle. Whether it’s discovering incidents, preventing them, or resolving them, there are many ways to support your delivery life cycle sustainably. This blog post shares some of the practices that your team or organization can employ to better support software delivery. If you’d like to learn more about how to better support your software delivery, try Harness for free.