This is a story about how one little company saved a shipload of cash on their cloud costs.
At Harness, we know all about cloud costs – and we know all about the potential savings a shop our size could muster if we took a good look at our spend. It would be enough for Smaug to feel a pang of jealousy – Smaug being, in this case, Google Cloud.
There was no “Easy Button” (hint: Continuous Efficiency) for us when we went through this. It was a team effort that involved everyone, from the CFO to the Engineers. We’re happy to walk you through what we did to save 40% on our cloud costs. Yup, you read that right: 40%.
How Did We Stick It to The Man?
Our CFO used to shoulder the responsibility of managing cloud costs, but this was too slow and reactive. That’s not to place blame on the CFO, it’s just a normal side effect of managing cloud costs from the top down.
After thinking about this issue, we decided, hey, wouldn’t it make more sense to empower our own engineers with cost visibility? Yes, yes it would. If our engineers immediately understood the impact of every deployment on our cloud bill, they could control costs proactively. So we embodied the concept of being proactive instead of reactive. We started managing cloud costs on an hourly basis instead of waiting for the cloud bill every 30 days and adjusting/lowkey panicking from there.
It was worth it. That 40% number I quoted above? It resulted in lowering our costs from roughly $100,000 per month to $60,000. Annualized, that’s a savings of $480,000, or three FTEs. We should have done this a long time ago.
Empowering Our Engineers
In order to empower our engineers to control cloud spend from the bottom up, we had to implement a few steps.
- We gave all of our engineers access to cloud costs for their individual apps, microservices, and clusters. This is a level of granularity they’d never seen before. Prior to this, engineers had only received access to see the pure ‘cloud infrastructure’ view.
- All engineers had to correlate their deployments, autoscaling, and Kubernetes config changes with cloud cost so they could see the financial impact of their actions. This proved to be key in affecting real change.
- Each and every engineering team – and application – was assigned an appropriate cloud budget that alerted them based on actual and forecasted usage for the month. These proactive alerts were received in Slack so they were easy to see.
- Teams met every two weeks to review cloud consumption and spend.
Optimizing Our Infrastructure
In addition to the human side, we made some improvements on the infrastructure side. We focused on two categories where we knew we’d see the most impact.
- We deleted unused node pools in our Jenkins cluster (used for application builds).
- We deleted unused namespaces in our QA environment.
Harness SaaS Infrastructure (the fun stuff):
- Firstly, we optimized logging of our microservices so they consumed less storage from StackDriver logging.
- We downsized the spec of the production cluster (used for our machine learning microservice) that was showing high idle cost.
- We switched non-prod Kubernetes clusters to preemptible instances (in layman’s terms, they were spot-destroyed after 24 hours of use).
- Additionally, we fixed an auto-scaling configuration issue that was detected in QA for one of our microservices.
- Lastly, we implemented Kubernetes pod disruption budgets so pods could be fully available when underlying compute nodes were decommissioned.
With these easy-to-implement changes, we saw our cloud costs dramatically decrease – and you could too. As a result of our own cloud cost optimization efforts, we created a new solution that we call Continuous Efficiency. This software enables other organizations to empower their developers and DevOps to control cloud costs from the bottom up, just like we did at Harness.
Where Our Customers Typically Overspend on Cloud Costs
Over time, we’ve been able to get a good grasp on where our customers overspend in the cloud. Here’s a freebie: there’s a trend of engineers deploying microservices using Kubernetes/ECS and overspeccing their cluster infrastructure by 30-40%.
Most of our customers are onboarding 10-20 new microservices a year, so why is this happening? Well, guesswork/safety is a core reason: engineers don’t know what resources they will need in production, so they overestimate. If your cloud costs are skyrocketing, it’s always a good idea to run an audit and figure out the five Ws: what, why, who, where, and when?
Some Real Life Cloud Cost Savings Our Customers Experienced
Relativity saw six-figure annual savings in their first 30 days with Continuous Efficiency by detecting an inefficient microservice in their Azure Kubernetes Cluster that related to a new microservice.
GoSpotCheck reduced their cloud spend by 80% for their search API microservice due to high idle cost of existing provisioned resources.
A Retail Customer had a $1,600,000 Kubernetes cluster configured which ran at 85% idle. Within 3 days of fixing the cluster, they saw six figure savings. This was just one cluster out of hundreds.
Tyler Technologies, Inc. – “At Tyler, we expect our engineering teams to have complete control of their applications and services, and that includes managing their costs. Since our clusters host applications from many different Tyler business units, it was previously very difficult for our finance team to accurately allocate cloud computing costs to the right P&L,” said Jeff Green, Chief Technology Officer at Tyler Technologies, Inc.
“With Harness, that task becomes trivial and we can shift down this responsibility. Continuous Efficiency takes the burden off our finance team and gives our engineers visibility into their costs as well as the tools and insight they need to reduce costs and make their services utilize resources as efficiently as possible.”
We Like Saving on Cloud Costs. Don’t You?
Isn’t it great when companies practice what they preach? We saved a bunch of money and we’d love to help you save bunches of money so that you, too, can be like Smaug sitting on his mountain of treasure and gold. Try Continuous Efficiency today.
Here’s a handy dandy table if you want to see, in more detail, our process change and its impact.
|Access||CFO, CTO, & DevOps||CFO, CTO, DevOps, Managers, & Engineers|
|Cost Management||Reactive||Proactive & continuous|
|Visibility||Infrastructure only (cloud services).||Infrastructure, applications, microservices, & production/R&D (QA).|
|Budgets||Based on total spend.||Based on granular spend (production/R&D, microservices, & infrastructure).|
|R&D||Reasonably optimized.||Spend reduced significantly without impact on new initiatives. Moved non-critical workloads to preemptible instances, deleted unused nodes & namespaces.|
|Production||Reasonably optimized.||Reduced significantly without impact on performance. Reduced logging, idle spend on ml/ai clusters and right-sized clusters.|
|Committed Usage||4x.||3x. CPUs dropped from 1600 to 950 (commitment was 700).|
|Time Spent to Reduce Costs||1-2 hours per month||4-5 hours per month|
|Cost Triggers||Based on spikes in usage||Increase in # of developers, investment in new products, & control costs due to COVID.|