One of the main responsibilities of the site reliability engineering (SRE) team is to ensure application reliability, or that applications are available and functioning as expected for their end users. One way to measure application reliability is through a service level objective (SLO).
In the world of software delivery, there is a never-ending struggle between delivering new software features quickly and providing application services that customers can depend on when they need them. Now, it's up to teams in both IT and development to support software delivery best practices with minimal disruption to customers. SLO management is intended to bring balance to software velocity and reliability needs. In this article, we’ll discuss SLO management and how using service level objectives is a key component of site reliability management.
SLO management has its own set of terms and acronyms that are important to know.
A service level objective (SLO) is a target level of service that an organization aims to provide to its customers or users. Effective SLOs ensure that an organization is delivering a high level of service to its customers and meeting business objectives. Specifically, these are goals within an organization related to the reliability of an application service. An internal SLO is not communicated outside of the organization and is not legally binding.
Service Level Objective (SLO) management involves setting and maintaining the target levels of service, as well as monitoring and measuring the actual system performance to ensure that it meets target levels. Management can involve setting targets for uptime, response time, error rates, and other performance metrics. SRE teams then monitor and analyze these metrics to identify and address any issues that may arise.
A service level indicator (SLIs) provides insights into the health of a service. It is the core metric used to indicate if specific service level indicators are met.
Service level agreements (SLAs) are legal agreements communicated to customers by business and legal teams that explain the implications if an expected service fails to meet the promised targets. For example, for system availability, if uptime is below the promised level, the service provider may be subject to paying fines or penalties to paying customers.
An error budget is a tool that helps the SRE and development teams work in tandem to control release velocity by ensuring that reliability targets are achieved. The error budget is an allowance for SLO violations that can accumulate over a certain timeframe for your service before your customers are impacted. Failures are inevitable when you constantly change your systems. Therefore, normalizing failure as a part of the process helps teams balance innovation with the risk of SLA violation.
There does not have to be a direct relationship between SLAs and SLOs, but there often is. Many reliability teams set their SLOs based on existing SLAs as a quick and easy starting point. For example, if the SLA for a service is 99.9% availability, then a good SLO for the same service could be 99.95%. The SLO is more restrictive than the SLA because the SLO should be strict enough to preemptively help the team avoid violating the SLA.
Reliability targets must take into account the business needs. SLI metrics will be provided by your monitoring or observability solutions. SLIs should reflect the customer experience of the core application or a particular service, not necessarily every individual service. Too many metrics will only hinder your team’s ability to focus.
Common SLIs include the four golden signals: latency, availability, throughput, and error rate. You should consider how to implement each of these. For example, latency (also known as response time) can be measured for all transactions flowing through an application or for a subset of the most important transactions (e.g,. login, submit payment, add to cart, etc).
You’ll need to pick a metric that provides a meaningful representation of your customers’ experiences and also define a threshold for that metric.
SLO definition is a collaborative process driven by the reliability team. SLOs act as the principal driver of decision-making, which enables you to discover the right balance between velocity and reliability. Breaching the SLO can potentially initiate activities that ultimately put pressure on engineering to stabilize the service before releasing new features.
Each of your SLIs will have an associated SLO (possibly having multiple SLIs per SLO). The SLO you define is the percentage of requests that should comply with the threshold you defined when you set up the SLI.
Calculated SLO = # of requests that meet the defined threshold / total number of requests * 100
Example:
2454 login requests took less than 100ms during a 1 minute period
2522 total login requests during that 1 minute period
Calculated SLO = 2454/2522*100
Calculated SLO = 97.3% for that 1 minute period
An error budget is measured in minutes over a defined time period. For example, if your SLO is set to 99.5% then the associated error budget (room for failure) per week is 50 minutes. Once the error budget is exhausted, teams should cease deploying new features and focus on service quality and reliability.
Overall, an error budget is a useful tool for ensuring that a system is reliable and available to users as much as possible, while also allowing for the flexibility to make necessary changes and improvements.
If there are reliability problems brewing in your environment, you want to know about them. Carefully consider what you want to be alerted on. Too many alerts or too few alerts might mean your team is missing some important reliability issues.
What’s most important is to know when reliability targets are not being met, indicating poor customer experience. You probably don’t want to get an alert on a one-minute violation once a month, because it’s not meaningful to the end user if it only happened during that narrow timeframe.
When SLO violations occur too frequently, the error budget burn rate goes up, and the remaining error budget decreases. The decrease in error budget is the time to send a notification via Slack, email, etc.
You might be tempted to set 100% reliability as an objective, but perfection is impossible. It’d simply mean that you choose not to make any changes in production, which is definitely not a wise business decision. Setting measurable and concrete reliability targets that allow for an appropriate rate of new feature delivery will result in happy customers. Finding this balance is central to creating the caliber of software experiences that your business needs to compete.
Establishing SLOs and creating error budgets can be a long journey, but the results are well worth the investment. Ready to learn more about SLO management? Request a demo of the Harness SRE solution, Service Reliability Management.