How Build.com Rolls Back Production in 32 Seconds

By Steve Burton
January 16, 2018

Roll Back

Even though Harness isn’t GA yet, we’re very fortunate to have several paying customers and partners. One of which is Build.com, who kindly shared a success story last week around how Harness is helping them move fast (in production) without breaking things.

Prospects and analysts will sometimes ask “Yeah, I get Harness – but do you REALLY have customers that automatically deploy, verify and rollback in production?” Yes, actually, we do. Automation sounds scary, but then again so does picking up your mission-critical apps and moving them to a large bookshop in the cloud.

Before I go on, I just want to give a shout out to Ed, Tim and the team at Build.com who were early believers in Harness – and more importantly, a big believer in CI/CD and the use of machine learning to automate the verification and rollback of production deployments.

Here is the story of how they’ve managed to automate their deployment pipeline end to end in just a few weeks:

The Deployment Pipeline

The first screenshot below shows Build.com’s actual deployment pipeline, composed of 7 different stages and workflows:

  • Stage 1 & 2 – Dev and Test workflows (execute in parallel)
  • Stages 3 thru 5 – Edu/Sandbox/Staging workflows (execute in parallel)
  • Stage 6 – Manual Approval
  • Stage 7 – Production workflow

You can see that the deployment pipeline looked solid right up until Stage 7 where the production workflow failed.

The Failed Production Workflow

Below we can see the failed production workflow in more detail.

The workflow starts off with a canary deployment where phase 1 upgrades 20% of the production environment. Next, Harness marks this new deployment in New Relic and then instantly connects to both New Relic and Sumo Logic to verify the performance and quality of the service/application.

During this verification process, the Harness unsupervised machine learning algorithms start to analyze, compare and flag anomalies/regressions from the thousands of log entries and time-series metrics that both tools capture from the application.

You can see from the screenshot above that application performance was verified successfully but the application quality verification failed. This is normally a sign that Harness observed something unique and unexpected – typically a new event, error, or exception that has been introduced to the service/app.

Continuous Verification with Machine Learning

Prior to Harness, 6 to 7 team leads would spend 60 minutes verifying every production deployment. Now, one engineer can do this job in a matter of minutes.

Below is a screenshot that confirms why the above application quality verification step failed. By analyzing the application log data in Sumo Logic, Harness’s machine learning algorithms were able to detect four new quality regressions. Specifically, Harness detected 4 new exceptions that have never been observed before in any previous deployments.

The grey dots you see in the below chart represent “baseline events” or “clusters” – these are events that Harness has learned over time and are classified as “normal” because they are observed frequently during deployments. The red dots represent unknown events or events that have an unexpected frequency. These are typically the things that bite you in the ass during a production deployment.

Within seconds of detecting these regressions, Harness performed a “Smart Rollback,” taking the service/application back to the last working version (artifact & run-time configuration). It’s worth mentioning that a Smart Rollback can be either fully automated as part of the workflow or controlled via manual intervention (a human).

The Smart Rollback (in 32 seconds)

If we zoom in at the top of the above workflow, you’ll see visual confirmation that a rollback actually occurred after Phase 1 of the canary deployment failed:

Perhaps the most pleasing aspect of this deployment failure was the time it took to automatically roll back to the previous working version…..just 32 seconds. The last time I chatted with Ed, Tim, and the team back in October, they told me that on average it took them 32 minutes to manually rollback a production deployment due to the number of scripts, dependencies, and configuration. Build.com is second only to Home Depot with well over $500m of eCommerce revenue so every minute of downtime and rollback counts.

So there you go – proof that today it’s possible to automate your entire CI/CD process end to end. In Build.com’s case, they use Jenkins for Continuous Integration and Harness for Continuous Delivery and Continuous Verification of New Relic and Sumo Logic data.

Better still, no one was hurt or replaced with the use of machine learning in this movie.

Cheers,
Steve.

@BurtonSays

 

 

➞ Back to Blog

Leave a Reply

avatar
  Subscribe  
Notify of