The main challenge facing deployments is validating the health of newly deployed service instances. Before Harness, you had to hook up your data to multiple systems and manually monitor each for unusual post-deployment activity.
Enter Harness, and the DevOps community was introduced to Continuous Verification (CV), where machine learning models learn normal application behavior to identify anomalies that can be flagged in future deployments. I caught up with one of our early customers who adopted our vision for Continuous Delivery along with Continuous Verification. They have successfully used Harness to flag service issues during deployments dozens of times, and thereby quickly responded to what could have been major outages.
This blog is an attempt to detail our recommended best strategy for achieving Continuous Delivery with high confidence, ensuring that deployed services are working as expected.
The best way to validate your deployments is during Canary analysis. With Canary, you deploy your containers or services in phases slowly building up to 100% of your cluster. The benefit of Canary is that early on in the deployment you have multiple versions of the containers or services running simultaneously.
Harness achieves best results by comparing the data from newly deployed services instances to data from services instances already running. This method can not be applied when you deploy your containers or services to 100% of your cluster at once. Below is an example of a two-phased Canary deployment with 50% of the cluster targeted at each phase deployed using Harness:
You need data for Data Science :). Harness can look into your transactional data, infrastructure metrics, and logs to validate your deployments. We integrate with a number of APM providers like AppDynamics, Prometheus, Datadog, NewRelic and a number of log aggregators like Elasticsearch, Splunk, and Sumo Logic. The customer I mentioned earlier uses NewRelic and Prometheus to instrument and collect data.
You setup verification against your APM or log providers on the early Canary phases when multiple versions are running simultaneously. Here is an example using NewRelic and CloudWatch in phase 1 to decide if the new containers are working well.
In the image above, our machine learning (ML) models are able to detect a high-risk transaction based on the NewRelic error metric displayed below:
Here is an example of the drill-down view that appears when you click the red circle. You see which containers are impacted at a container level, and the metric data analyzed.
Note that this is a multiphase deployment, so Harness can learn what is normal using data from the containers that have not been updated yet. This normalizes the result against changes in traffic or your environment because it applies equally to the old and new containers. So any deviations are most likely the result of your code commits or config changes within the container.
This particular customer aborts the deployment as part of their Rollback/Intervention Strategy. They like to keep the small set of newly deployed containers or services, direct load toward it, and troubleshoot some more. You can also choose to continue with your deployment if the anomaly is something that can be quickly fixed, or choose to always initiate an immediate rollback.
Canary Verification Challenges
The following verification challenges can occur with Canary deployments:
- Can’t run multiple versions at the same time. This usually happens due to backward compatibility issues, upstream/downstream dependencies, and for stateful services
- Logs/metrics not collected for the service
- Collected data not stamped with the container-level metadata. Instead, there is only service or application-level metadata
- Don’t want the extra 15 minutes added for verification in each of the initial phases. Verification might take more time than you desire for a phase of your deployment.
If you have to use a different deployment strategy like Blue/Green, don’t despair. Harness has Blue/Green validation strategy second only to Canary analysis. I will cover this in a future blog post.