Harness 24/7 Service Guard Empowers Developers with Total Operational Control

Continuous Delivery requires developers to see the impact of their production deployments. With 24x7 Service Guard, they now get total operational visibility.

By Steve Burton
December 13, 2018

When Harness came out of stealth, we entered the Continuous Delivery market with several unique capabilities. Our Smart Automation helped customers build deployment pipelines in minutes, and our Continuous Verification helped developers automate the verification and rollback of their deployments. That’s basically how we’ve helped our first 35 customers move fast without breaking things.

Today we’re announcing 24/7 Service Guard, which is basically Continuous Verification on steroids.

Harness 24/7 Service Guard is like developers having a dedicated bodyguard watching their production apps 24/7. If something bad happens it will automatically roll back code changes and protect them.

Why 24/7 Service Guard?

Our initial Continuous Verification was focused on deployments or canary phases, analyzing the performance/quality of new code during the first 15-20 minutes of its life. Customers could customize this verification duration but it was always finite in scope.

This capability was great at catching 2/3rds of performance anomalies and/or quality regressions because most applications fail within minutes of a deployment.

24/7 Service Guard was created to catch the anomalies/regressions that surfaced many hours after a new deployment. Sometimes deployments are done out-of-hours when minimal traffic is using the app, or specific functionality in the app might not be accessed or stressed immediately by users.

At the same time, our customers were struggling with monitoring tool fatigue. They had one of everything to monitor different aspects of their application. In a microservices world, a customer could have tens of microservices with tens of different monitoring tools, logs, and instrumentation. Unifying these data sets is a huge challenge for developers and teams.

A recent Gartner poll of 220 customers highlights this problem:

Tool fatigue

Catching post-deployment issues and tool fatigue are why we created the 24/7 Service Guard. We want to give developers total operational visibility of their production apps across all tools, and protect them when they weren’t looking.

Powered By Unsupervised Machine Learning

Like Continuous Verification, our 24/7 Service Guard sits on top of all your APM, monitoring and log tools. However, we’ve modified our unsupervised machine learning significantly to scale for 24/7 data streams.

We’re still using the core algorithms such as Symbolic Aggregate Representation (SAX) and Hidden Markov-Models, but we’re also applying entropy and several new neural nets so we can continuously learn and detect the unknown unknowns as well as reduce the false positives. Watch this webinar if you want a tech deep dive on our AI/ML.

Harness uses Harness for Continuous Delivery so we’ve been battle testing 24/7 Service Guard for some time, and refining its accuracy for several weeks.

Unifying APM, Log and Observability Data

Simply add one or more of your monitoring tools to Harness in minutes by registering your tools URL, API/Webhook, and login credentials.

Next, for each application in Harness add the verifications you want for each environment (dev, QA, staging, production) and Harness will figure out the rest.

Once set up, click on the top Continuous Verification navigation tab and you’ll see something like this:

service_guard_full2

We can see above that 24/7 Service Guard is protecting the Web Online Application and is observing 4 monitoring sources (AppDynamics, Datadog, Splunk and Prometheus) for the production environment. You will see the same view for every application and environment you enable 24/7 Service Guard for.

At a glance, developers can now observe the health of any service in any environment for any monitoring tool in seconds.

Excuse me for a second, but that’s pretty badass (and I haven’t even got to the best bits yet).

Users can select from several time resolutions: 12 hours, 1 day, 7 days and 30 days.

Based on the data that 24/7 Service Guard is observing from each monitoring tool, it will paint a heatmap of service health for each time slice square. It will also show and correlate any deployments so users get full operational visibility.

Understanding Service & Business Impact

Traffic lights are an easy way to understand service health at a high level.

With 24/7 Service Guard, developers can drill down beyond traffic lights into the business impact of a service in one-click.

For example, if we click on the red time-slice highlighted below, we immediately see the business transactions in AppDynamics that are impacted along with related anomalies/regressions highlighted in red.

We can see below that the transaction /online/Payment is experiencing high response times:

service_guard_impact2

The insight of this drill-down capability is driven by the type of monitoring tool behind the data. For example, if Datadog was showing a service impact, 24/7 Service Guard would show the cloud infrastructure resource metrics that are anomalous. If Splunk was showing an impact, it would show the errors or exceptions that are causing the regressions and so on.

Think of the above as an easy way for a developer to immediately understand what is going on across their monitoring data sets.

Drill-Down To Root Cause With Context

24/7 Service Guard doesn’t stop there–it gets better 🙂

Harness also provides contextual drill-down that takes the developer from Harness into their monitoring tools in the context of the metric or event they are troubleshooting.

In the above example, a developer can click on the red /online/payment/ transaction and it will take them directly into the AppDynamics UI for the specific transaction and time period that was anomalous.

In one-click you can go from this:

trx

to this:

call_stack

24/7 Service Guard will do the same for all your favorite APM and Log tools.

This capability gives developers a unified view of their monitoring tools/data, and allows them to take a shortcut to the root cause in just a few clicks. Sounds simple, but it’s extremely powerful.

Automatically Roll Back Code Changes

The final part of 24/7 Service Guard is its ability to automatically roll back code changes (if needed) when the developer isn’t looking.

For example, let’s imagine a developer performed a production deployment using Harness at 3 p.m. on a Friday. After 10-15 minutes of verification, everything from a performance and quality perspective looks good in Harness. Shortly after, the developer hits the bar and proceeds to drink 5 pints of Guinness. As the 5th pint of Guinness goes down, the application starts to grind to a halt. 24/7 Service Guard detects this performance anomaly, and as a precaution automatically rolls back the application to its last working state.

The developer wakes up the next day with a hangover and notices an alert from Harness the night before. In one click, the developer launches Harness in the context of the performance anomaly and identifies which transactions were responsible along with links to the root cause of the performance issue inside their APM tool.

Supported Applications and Tools

Harness supports both non-container and containerized applications across all cloud-providers and bare metal data center infrastructure.

We currently support AppDynamics, New Relic, Dynatrace, Datadog and Prometheus for APM and time-series metrics. We also have an API to support custom time-series data.

We also support Splunk, Elastic/ELK, Sumo Logic, Bugsnag and Logz.io for Log and event data.

Sign-up for your free trial of Harness today and give 24/7 Service Guard a shot.

➞ Back to Blog

Leave a Reply

avatar
  Subscribe  
Notify of