Harness Service Reliability Management Goes GA: Experience the Power of SLO Management Without the Pain
Harness Service Reliability Management (SRM) SRM has officially reached General Availability (GA) status as of June 21, 2022.
Service Level Objective (SLO) management proactively alerts support teams to application reliability issues before Service Level Agreement (SLA) violations occur, helping companies avoid penalties and reputational harm. Harness Service Reliability Management (SRM) was created to help companies adopt, scale, and automate SLO management processes. Today we are announcing Harness Service Reliability Management is now generally available.
During the public preview, we gained valuable feedback from our customers about what they need to overcome certain challenges with SLO management, along with the benefits they stand to gain from successfully adopting the practice.
Advanced, a business software and services provider and Harness customer, looks forward to the benefits Harness SRM can provide them. “We were really excited to see how Harness SRM can facilitate a new level of collaboration between our development and reliability teams,” said Martin Reynolds, Head of DevOps and IaaS at Advanced. “Problems that used to take multiple weeks to repair can be identified and fixed quickly, improving the overall reliability of our systems. Harness' SLO and Error Budget capabilities will help us avoid SLA violations and penalties, alerting service owners to potential issues, so they can take action.”
Let’s take a closer look at the ways SLO management helps companies improve the reliability of their application services and how that translates to business value.
Common Pain Points of SLO Management
Since the early days of IT, companies have searched for ways to proactively identify problems with their application services to avoid impact on customers. This led to the proliferation of observability, monitoring, and logging tools. These tools are good at detecting issues as they occur, but on their own, they cannot be used to determine how to make adjustments to the velocity of the software delivery practice to ensure that SLA violations don’t occur.
Here are some of the most common challenges our customers asked for help solving around SLO management:
Building an SLO management practice takes time - It takes a combination of the right people, with the right knowledge to build the practice of SLO management. This often requires hiring new employees and giving them time to change the processes, tools, and culture within the organization.
The reliability and engineering teams are not working from the same data - The reliability team keeps an eye on SLOs while the engineering team is heads-down building software. When reliability issues occur, the engineering team is surprised when they are asked to slow down.
Reliability teams are manually managing, tracking, and taking actions on SLOs - Not only does this cause extra work for reliability engineers, but it leads to errors and inconsistent governance of software delivery pipelines.
Determining what changes have impacted SLOs is difficult and slow - When SLOs are breached, it’s crucial to identify the root cause and remediate it before SLAs are violated and penalties are incurred. With frequent changes, this can be a difficult and time consuming task.
Scaling SLO management beyond a handful of application services is cumbersome - With so much manual process, it becomes difficult to adopt SLO management across all services that need it.
Overall reliability of applications improves, but too slowly - Even with manual processes, reliability should improve over time. Without tooling designed to accelerate this process, improvements will be slow.
Verifying the quality and reliability of individual deployments is a manual process - After each software deployment, engineers look at logs and metrics for hours to determine the quality of the software. Reliability engineers look at similar dashboards for days or even weeks to determine the reliability of each deployment.
Business Benefits of SLO Management
The SLO management practice was created to balance software reliability goals with velocity of software delivery. While some businesses may choose to focus less on reliability, others may prioritize reliability over the speed of innovation. Reliability teams, development teams, and business leaders must all agree on which approach they want to take for each application; then they set SLOs and Error Budgets appropriately to achieve these goals.
When implemented effectively, the SLO and Error Budget data can be used to provide automated guardrails throughout the software delivery lifecycle. These guardrails can control whether new software deployments are allowed or blocked. As a result, the business can achieve the desired balance between reliability and innovation, and they can do it repeatedly, at scale. Ultimately, the business benefits by keeping its customers happy with both the reliability of the applications and the pace that new features are delivered.
When companies actively manage the balance between reliability and velocity, they will realize a variety of business benefits, including:
Protection against revenue loss - ITIC’s 2022 Global Server Hardware Security survey indicates that the Hourly Cost of Downtime exceeds $300,000 for 91% of SME and large enterprises. “Overall, 44% of mid-sized and large enterprise survey respondents reported that a single hour of downtime, can potentially cost their businesses over one million ($1 million).”
Improving customer retention - A study conducted by Profitwell, a BI solutions provider, found that the cost of customer acquisition grew by 60% between 2014 and 2019.
Avoiding SLA penalties - Every company that provides software services publishes a list of penalties associated with SLA breaches. Typically the penalties increase as the severity of the SLA breach increases.
Increased deployment velocity - As SLO management leads to improved reliability, engineering teams gain the confidence they need to increase the speed of software delivery without fearing revenue-impacting events, like SLA violations or customer churn.
Adopting, Scaling, and Automating SLO Management with Harness SRM
Harness SRM helps companies adopt, scale, and automate SLO management, so they can reap the benefits of improved reliability without suffering the pain points discussed above. With SRM, reliability and engineering teams can more easily collaborate on SLO management with a unified workspace, ensuring that each team works from the same data.
SRM leverages Harness Policy as Code, enabling DevOps teams to write policy-as-code, which provides guardrails within pipelines. This powerful feature also equips reliability engineers to automatically control the software delivery process in the event that SLOs are breached and SLAs are in danger of being violated. All of this automation makes it possible to scale SLO management practices across the largest enterprise organizations.
Here are some key capabilities available in Harness SRM today:
SRE and Developer Self-Service - Intuitive user interface with fine-grained role-based access controls (RBAC) shared by both development and reliability personnel.
Data-Driven SLO Management - Create and manage Service Level Indicators (SLIs), SLOs, and Error Budgets using metrics from almost all solutions across the SDLC, including APM, infrastructure monitoring, and logging.
Change Impact Analysis - SRM structures, correlates, and enriches data to deliver a single unified view with a Service Health score by visualizing which changes are contributing to reliability degradation.
Continuous Reliability Improvement - Automatically resolve uncaught and swallowed exceptions in Java and .NET applications to identify the impact to service health.
Active SLO Management Governance - Add reliability checks and pipeline governance policies based on SLOs and error budgets.
Continuous Verification - Track faulty deployments and other changes that lead to SLO violations with the ability to automatically rollback.