September 19, 2023

How Quizlet Automates Pull Requests for Cloud Cost Recommendations

Table of Contents

This is a guest post by Harness customer Brian Walsh, Sr. Platform Engineer, Quizlet

Managing cloud costs is always a challenge, especially when we need engineers to take time away from other work to look at and take action on cost savings recommendations. Historically, this was also true for my company, Quizlet, which is a global learning platform that provides engaging AI-powered study tools to help people practice and master whatever they are learning. 

Quizlet implemented Harness Cloud Cost Management some time ago for its cost savings automation features, which went well, but we had a number of recommendations for additional cost savings that weren’t being acted upon. As a Senior Platform Engineer in the Platform team, I was tasked to find a solution to the problem.

Knowing our organization, our team knew that to empower our engineers to take action on cloud costs, we needed to make things as easy as possible for them, with as few clicks as possible for the owners of the cloud services. This led us to create an automated workflow to create pull requests for the Harness recommendations, focusing on our microservices as a starting point.

Going Where the Engineers Are

Before I go into detail about the process and how it works, I think it’s important to point out that just creating the automated pull requests in our GitHub repository wasn’t enough to empower the engineers to take action. They don’t live in the repository on a daily basis, so they weren’t seeing the PRs or reviewing them. We had 0 engagement at the start from any of the 10+ engineering teams that used cloud infrastructure for their applications.

The first step I took to increase engagement was to integrate with Slack so that all new PRs were sent to a dedicated Slack channel that the microservices owners were part of. This raised the visibility of the new PRs immediately, as well as enabling the service owners and the Platform team a place to discuss the impact and implementation of the PRs as needed. 

The second step was to present this new workflow to the engineering management team to get their support for the engineers to take action on these PRs as they came in. Between these two actions, we went from 0 engagement, to 75-80% engagement on the automated PRs! These PRs on average saved us 40% of our previous spend on Kubernetes workloads they were applied to. 

How the Automation Works

The automation runs once a week, for the development environment services on the first day, and for the production environment services the day after. It’s a Google Cloud Function launched from the Google Cloud Scheduler, architected to integrate with the Harness API (to pull the recommendations), our microservices infrastructure GitHub repository to retrieve and edit the configuration files, and Slack for posting the recommendation PRs to a shared microservices Slack channel.

This is what the architecture looks like:

We’ve configured the automation to pull recommendations from the Harness API that are over a certain $ amount, configured by the admin. This helps to keep the engineers focused on recommendations that have a more material impact on costs, so that we strike a balance on cost savings and engineering productivity.  

Because we have a well defined repository structure for the microservices, it has enabled us to more easily automate searching the repo for the necessary configuration files to retrieve and edit. The Harness recommendation includes the name of the Kubernetes workload, allowing us to navigate directly to the correct directory for the service. Over time, we’ve accumulated the use of both Helm and Kustomize for configuring our microservices, so we added the automation needed to be able to differentiate between the two in order to find the correct config file path and the config values necessary to modify. This did add a small bit of complexity that wouldn’t be necessary if we were using a single Kubernetes configuration tool. 

Once the file is retrieved, the recommended changes are written into the YAML configuration file, then the file is pushed to GitHub, and a PR created. The weekly PRs are automatically assigned to the service owners via the infrastructure repository’s CODEOWNERS file, so it’s important to keep this file up to date as ownership changes over time. Once done, notifications are sent to Slack for all PRs created during the scheduled job that week.

Reviewing Pull Requests

With the Slack integration, it is very easy to engage with the service owners and answer any questions they may have about the PRs or the potential impacts to prod if implemented. We’ve automated the creation of an easy-to-digest PR description for these recommendations, again to make things as easy as possible for the service owners. Once the PR is reviewed, it gets approved (or rejected) by the service owner, and merged when ready.

There are a lot of PR and recommendation questions that the engineers ask frequently, so we created an FAQ to answer some of the most common questions, such as:

  • How do I know the recommendation won’t break my application in prod? Nothing is ever guaranteed, but the Quizlet Platform Team has configured the Harness recommendations to, on average, best match the application needs for our organization (i.e. performance-optimized vs savings-optimized). We’ve verified those settings against a variety of applications, though do recommend using your service owner knowledge to tweak and combine the recommendations with appropriate horizontal pod autoscaling (HPA) settings to yield optimal results.
  • What if I want the recommendation to be slightly less aggressive? You can always modify the PR with less aggressive CPU and memory settings, and apply those. We include a link to the recommendation in Harness in the PR, which allows our engineers to tune the buffer meter on the right column of the recommendation page to generate proportional cpu/mem recommendation numbers going forward.
  • How do I know the PR will be assigned to the right person? It’s the service owners’ job to make sure that the CODEOWNERS file properly points to the correct group. If that is done, then the PR will be automatically assigned to the correct group. 

If You’re Curious to Learn More

We couldn’t make this automation and new reporting processes work without having the right tools in place, and Harness Cloud Cost Management has really enabled us to do more with our cloud. You can read more about Harness’ Automated Cost Savings features, or talk to Harness sales to get more information on how it’s working for us. 

Cloud Cost Management