Imagine waking up to a shocking spike in your cloud bill, with a huge cost spike in your Google BigQuery costs, and scrambling around to find the root cause of the problem. It is not desirable, but not very uncommon either. We need to first react to the incident and also proactively implement ways to avoid such spikes in the future. In this blog, we look at ways to identify the cost spikes, some guardrails to put in place, and 3 best practices to keep your BigQuery costs in check.
Digging Into BigQuery
Google BigQuery has a pricing model where the user is charged for the amount of data scanned. It is easy to overlook these costs since, traditionally, we pay only for the compute capacity of databases and we can, in theory, run a huge volume of queries. Just to put things in perspective: $15k worth of spend translates to scanning 3000 TB of data. This equates to 10 developers running 10 queries that scan about 30 TB of data. It doesn’t look very unrealistic when we put it that way, does it?
Identifying a cost spike can be challenging, but it is critical because if we don’t, it can snowball into a huge liability at the end of the month when we look at our cloud bill. One approach is to set up budgets and alerts on GCP for each unique combination of project and service, but this becomes painful to manage once we have a large number of projects. We internally don’t need to manage GCP budgets for this use case, since Cloud Cost Management’s Anomaly Detection feature detects any unexpected cost spikes using Machine Learning with notifications streaming into slack.
What can we do to set up guardrails to avoid such cost spikes in future? We can set up custom cost controls for BigQuery. Setting up a project-level and user-level quota can help set a cap on the usage, which resets every day. This can limit a particular user from running too many expensive queries and also overall limit a particular project from exceeding the budget. To avoid huge queries being fired, we can also limit the maximum bytes billed by a single query.
But sometimes, this might not be the optimal solution. When we know our Analysis on BigQuery is going to be extensive, guardrails might occasionally hamper productivity. In such scenarios, we can choose to commit to BigQuery Slots to enjoy a flat-rate pricing. Also, if we do know that the usage behaviour is spiky in nature (eg: we only run our analysis jobs for 1 hour in a day), we can consider buying BigQuery Flex Slots instead of committing for the entire month, where GCP offers the flexibility to reserve compute capacity for as low as 60 seconds. Reserving flex slots for just the duration of the spiky usage can avoid long-term commitments and vendor lock-in while optimising for cost.
3 Cloud Cost Best Practices for BigQuery
Reduce the Amount of Data Scanned
- Use clustering and partitioning to reduce the amount of data scanned.
- Clustering and partitioning can help to reduce the amount of data processed by queries. To limit the number of partitions scanned when querying clustered or partitioned tables, use a predicate filter.
- When running a query against a clustered table, adding filters on the clustered columns can reduce the data scanned.
- When querying partitioned tables, filters on the partitioning column are used to prune the partitions, and therefore can reduce the query cost.
- Don’t run queries to explore or preview table data.
- If experimenting with or exploring data, users can use table preview options to view data for free and without affecting quotas.
- Query only the columns that you need.
- Include only the columns needed for analysis. SELECT * is the most expensive way to Query in BigQuery.
Storage Costs in BigQuery
Data in tables that have not been edited for 90 days (Long-Term Storage) or above are 50% cheaper than Active Storage. In an effort to leverage this, make sure not to edit tables. The best practice hence is to move new data into a new partition, making partitioning of tables important.
There are various expiry options at dataset, table, and partition levels. We typically create tables and forget to clean them up, hence it is recommended to set an expiration for tables and partitions at the time of creation to avoid costs piling up.
If we have Serverless ETL pipeline setup using cloud function or GCP Dataflow, we might have intermediate staging tables, which might not always add much value to the end users. Thus, we should clean up such intermediate tables to avoid storage overhead.
Caching Intelligently and Streaming Costs
We should use streaming inserts only if the data must be immediately available. The reasoning behind this is very simple: Streaming insertion into BigQuery is chargeable and loading data via Batch insertion is free.
BigQuery caches results in a temporary cached results table. The cache is maintained per user and per project. So, running duplicate queries will fetch results from the cache and they are not billed. Also, if multiple microservices are same/similar queries, using the same service account across the microservices can help leverage the cache, since the cache is per user.
At Harness, we use our Cloud Cost Management Business Intelligence dashboards to get deeper visibility into BigQuery costs at a per query, per user level – and surfacing out the top queries and further optimizing them to reduce costs. To find out more about Cloud Cost management, request a demo today.