June 19, 2023

Continuous Verification - Machine Learning to Safeguard Your Deployments

Table of Contents

The only constant in technology is change. As technical professionals, our iteration allows us to create innovation. Though inherently, change brings about risk. Anytime you modify anything, there is a risk that change brings about. This paradox of build vs operate, the push and pull of making changes and limiting changes in the name of stability. Thanks to advancements in Site Reliability Engineering [SRE] practices, we can continue to move forward and the science of reliability for our changes allows for more iteration. 

Your deployment is a culmination of all the changes that have been made; when looking for regression can be a needle in the haystack type of problem. Harness has the ability to watch out for regression during your deployments leveraging machine learning (ML). Continuous Verification can validate your deployments based on health sources you wire in, while looking for regression. Effectively, using Continuous Verification will allow you to safeguard your deployments systematically by looking for trending regressions. Let’s look at the ways that Harness validates your deployments doing the math on your behalf regardless of your deployment strategy.

Incremental Deployments - Understanding Normalcy

Incremental deployment strategies are designed for safety and to limit the blast radius incase of a regression. Deployment/release strategies such as a canary release have become very popular. Though the more increments you have, the number of deployments increase to support those increments. Now the judgment call or checkout phase of a deployment occurs more frequently when validating the canary. If the canary phase had some sort of absolute failure, then the decision to stop the canary and roll back [or roll forward depending on your organization] would be easy. 

Though judgment calls can be harder to make especially if the data almost looks the same point-in-time between the stable and canary versions. Understanding normalcy can be difficult. Determining what is regression vs what is normal sometimes needs a lot of justification. As with any change in a system, judgment calls are around how much of a regression if any has occurred. These judgment calls have to be made more regularly with incremental deployment strategies.

Consider This Scenario

Taking a look at the below graph of average response times [ARTs] comparing version one [stable] and version two [new, canary for example] of a service, taking a cursory look at the graph, they are pretty similar.

Average Response Time Graph

In the above, there is not a smoking gun event or type of failure that would put the newest version, V2, into an absolute failure e.g the blue line off the chart. Determining normalcy or a baseline there can also be challenging as usage or traffic can spike up or down. Taking into account factors such as distance between those two graphs and if that distance or events are trending towards or away from regression, before regression approaches, is key to Harness’s analysis which is done on your behalf, systematically.

Harness Verification Analysis

When validating a system, two of the pillars of observability are metrics and logs. In the above ART metric, those metrics can plot on a graph. For logs, those represent a different set of challenges. Harness has the ability to ingest metric and log data from a myriad of different providers and correlate and analyze deployment events for regression.

Metric Analysis

Core to metric based analysis is finding the deviation between the before and after; in mathematical terms this would be the deviation between two graphs over time.

Analyzing Deployment Metrics

Harness scores the distances between pieces of time series data. As a human, taking a look at two different graphs, it would be easy to tell deviation if there was a wide deviation. If the graphs are similar like the above scenario, calculating distance would take time and also taking into account differences in standard deviation adds to the complexity a human has to calculate. When a deployment is going on, the system can make these calculations and take in multiple points of data usually faster than a human which is exactly what Continuous Verification does. This is especially prudent for log level analysis.

Log Event Analysis

Log data when compared to time series data is unstructured. As an engineer, you are typically looking for the presence of events such as “fatal” or the lack of presence of events such as “success” not being included in the log streams. Even seasoned engineers might not be able to explain every logging statement included in the logs.

Analyzing Deployment Log Events

Since logs can be seen as a system of record for what is happening in the state of the system, logs can be very verbose. Harness cluster’s similar events and measures the distance of these events to find events that are an anomaly. Harness is analyzing the content of log messages and if there is an increase in frequency of those messages appearing which would result in an anomaly as part of Harness’s processing steps.

Processing Steps

At a high level, these are the high level steps Harness takes to find an anomaly.

  • Query/fetch data based upon your inputs. 
  • Split the data between control [baseline, previous] and test [new] buckets for comparisons.
  • Cluster related events. 
  • Calculate the distance between the data points to calculate deviation. 
  • Calculate risk score for determination of an anomaly.

Based on the sensitivity selected, this relates to the number of standard deviations that are acceptable. The best way to see Continuous Verification in action is to take CV for a spin yourself.

Taking The Next Step with Continuous Verification

To get started with Harness Continuous Verification, check out this tutorial on Harness Developer Hub which goes through verifying a Kubernetes deployment using Prometheus. Since “reliability is everyone’s responsibility”, adding a Verification Step to your Harness Pipeline is a prudent step and included as part of your Harness CD subscription (Sign Up Here). Take a look at the tutorial and get further on the reliability journey, today.

Platform