Build.com Automated CI/CD Rollbacks to 30 Seconds

Build.com removed most of their verification effort. Find out how you can too.

By
Dan Lamm
Last updated
February 1, 2021
Build.com Automated CI/CD Rollbacks to 30 Seconds

Prospects and analysts will sometimes ask “Yeah, I get Harness – but do you REALLY have customers that automatically deploy, verify and rollback in production?” Yes, actually, we do. Automation sounds scary, but then again so does picking up your mission-critical apps and moving them to a large bookshop in the cloud.

Before I go on, I just want to give a shout out to Ed, Tim and the team at Build.com who were early believers in Harness – and more importantly, a big believer in CI/CD and the use of machine learning to automate the verification and rollback of production deployments.

Here is the story of how they’ve managed to automate their deployment pipeline end to end in just a few weeks.

The Deployment Pipeline

The first screenshot below shows Build.com’s actual deployment pipeline, composed of 7 different stages and workflows:

  • Stage 1 & 2 – Dev and Test workflows (execute in parallel)
  • Stages 3 thru 5 – Edu/Sandbox/Staging workflows (execute in parallel)
  • Stage 6 – Manual Approval
  • Stage 7 – Production workflow

You can see that the deployment pipeline looked solid right up until Stage 7 where the production workflow failed.

The Failed Production Workflow

Below we can see the failed production workflow in more detail.

The workflow starts off with a canary deployment where phase 1 upgrades 20% of the production environment. Next, Harness marks this new deployment in New Relic and then instantly connects to both New Relic and Sumo Logic to verify the performance and quality of the service/application.

During this verification process, the Harness unsupervised machine learning algorithms start to analyze, compare and flag anomalies/regressions from the thousands of log entries and time-series metrics that both tools capture from the application.

You can see from the screenshot above that application performance was verified successfully but the application quality verification failed. This is normally a sign that Harness observed something unique and unexpected – typically a new event, error, or exception that has been introduced to the service/app.

Continuous Verification with Machine Learning

Prior to Harness, 6 to 7 team leads would spend 60 minutes verifying every production deployment. Now, one engineer can do this job in a matter of minutes.

Below is a screenshot that confirms why the above application quality verification step failed. By analyzing the application log data in Sumo Logic, Harness’s machine learning algorithms were able to detect four new quality regressions. Specifically, Harness detected 4 new exceptions that have never been observed before in any previous deployments.

The grey dots you see in the below chart represent “baseline events” or “clusters” – these are events that Harness has learned over time and are classified as “normal” because they are observed frequently during deployments. The red dots represent unknown events or events that have an unexpected frequency. These are typically the things that bite you in the ass during a production deployment.

Within seconds of detecting these regressions, Harness performed a “Smart Rollback,” taking the service/application back to the last working version (artifact & run-time configuration). It’s worth mentioning that a Smart Rollback can be either fully automated as part of the workflow or controlled via manual intervention (a human).

The Smart Rollback (in 32 seconds)

If we zoom in at the top of the above workflow, you’ll see visual confirmation that a rollback actually occurred after Phase 1 of the canary deployment failed:

Perhaps the most pleasing aspect of this deployment failure was the time it took to automatically roll back to the previous working version…..just 32 seconds. The last time I chatted with Ed, Tim, and the team back in October, they told me that on average it took them 32 minutes to manually rollback a production deployment due to the number of scripts, dependencies, and configuration. Build.com is second only to Home Depot with well over $500m of eCommerce revenue so every minute of downtime and rollback counts.

So there you go – proof that today it’s possible to automate your entire CI/CD process end to end. In Build.com’s case, they use Jenkins for Continuous Integration and Harness for Continuous Delivery and Continuous Verification of New Relic and Sumo Logic data.

Better still, no one was hurt or replaced with the use of machine learning in this movie.

No items found.

Explore Related Content

UWM Deploys in Minutes Instead of Hours
UWM Deploys in Minutes Instead of Hours

UWM leverages Harness for self-service Kubernetes deployments and infrastructure creation, reducing deployment time from hours to minutes!

Last updated
April 4, 2022
Tyler Tech Takes CI/CD to the Next Level and Achieves Unparalleled Velocity With Feature Flags
Tyler Tech Takes CI/CD to the Next Level and Achieves Unparalleled Velocity With Feature Flags

To Tyler Tech, feature flags were a natural extension of CI/CD. Learn why they chose - and trusted - Harness to provide that capability.

Last updated
March 30, 2022
Lessonly by Seismic Says Goodbye to Jenkins and Toil With Harness CI Enterprise
Lessonly by Seismic Says Goodbye to Jenkins and Toil With Harness CI Enterprise

Learn how Lessonly by Seismic went from toil with Jenkins to peace of mind with Harness CIE.

Last updated
January 24, 2022
Campspot Reduces Outage Risk by 78%
Campspot Reduces Outage Risk by 78%

Campspot turned to Harness for reliable deployments with less outage risk.

Last updated
January 10, 2022

The Modern Software Delivery Platform

Loved by Developers, Trusted by Businesses
Get Started

Need more info? Contact Sales