For Harness’ Custom Dashboards, we use Looker. Looker gives us the flexibility we need to build dashboards at scale. One key area of flexibility is data engineering. With a wide distribution of customers in Harness, we need the flexibility to develop customer-centric solutions to not only expose data for exploration, but also optimize the data experience so metrics are served quickly and accurately. One particular feature of Looker we have exploited for some of our larger customers is aggregate tables. Aggregate tables allow us to create smaller rollup tables materialized in our database in a quick, abstract format with LookML.
In this blog post, we will describe what aggregate tables are, their fundamental utility and Harness’ use case with efficiency gain statistics.
Aggregate tables are a type of Persistent Derived Table (PDT) in Looker. Aggregate tables are Looker’s solution to persisting rollup tables for a large base explore. Just like PDTs, one can engineer a persistence strategy for each aggregate table. All aggregate tables are built upon a base explore, and Looker’s aggregate awareness intelligently selects the smallest aggregate table to satisfy an ad hoc query constructed within this base explore. See below for an example diagram:
Perhaps the utility of aggregate tables will crystallize more with an example: Suppose you have a large base explore giving second-level granularity into sales transactions for a large enterprise. This explore houses sales by customer, product and region. However, you are wanting to create a tile that displays last week’s sales across customers, and this requires parsing through millions and millions of rows just to generate a single value visualization. Not only does this require large table scans, it also takes an exorbitant amount of time to process. And your senior executives need this information. Fast.
One solution would be to “warm” the results by pre-running the query ahead of time for your stakeholders, taking advantage of Looker’s cache. However, this does not scale and is a largely manual process. It certainly isn’t best practice.
Another solution would be to employ a derived table for this use case, rolling up the sales metrics by week. This is better, but it still does not scale well for these use cases. Creating a siloed view would require at minimum a new explore. Thus, any visualizations currently created within your Looker instance would need to be migrated to this new explore. Furthermore, what if your senior executives come back and ask for the same data, but on a month-over-month level, year-over-year level? That would require two more views, two more explores, etc.
Aggregate tables allow for a more scalable solution for these quandaries. In short, aggregate tables allow the data practitioner to derive rollup tables for subsets of the parent explore. See diagram and example pseudocode below:
explore: sales { aggregate_table: _daily_sales { query: { dimensions: [ sale_date, customer, region, product, cost, sale_price, profit ] measures: [total_sale, total_profit, total_cost] filters: [ sale_date: "12 months" ] } materialization: { datagroup_trigger: daily_refresh increment_key: "sale_month" increment_offset: 2 partition_keys: [sale_date, customer] cluster_keys: [region, product] } } aggregate_table: _monthly_sales { query: { dimensions: [ sale_month, customer, region, product, cost, sale_price, profit ] measures: [total_sale, total_profit, total_cost] filters: [ sale_month: "12 months" ] } materialization: { datagroup_trigger: daily_refresh increment_key: "sale_month" increment_offset: 2 partition_keys: [sale_year, customer] cluster_keys: [region, product] } } }
Above is an example of spinning up a daily and monthly aggregate table for the sales explore. In this example, we have isolated the date of sale, customer, region, product, cost, sale price and profit. This is already a much more manageable subset of dimensions. We have also declared a few measures we would like to persist: total sale, total profit and total cost. The aggregate tables are also filtered for the past 12 months.
From a persistence perspective, this aggregate table is utilizing a few different strategies: incremental load, partitioning and clustering. This example is assuming we are persisting the table in BigQuery, but Looker provides other dialect-specific and generalized optimization strategies as well (such as adding indexes). Looker will use this as an optimization guide when persisting the tables in the database. Exploiting these materialization options will add an extra facet to your efficiency gains.
Focusing specifically on the increment strategy, we have set an increment_key on the sale_month and sale_year for the daily and monthly aggregate tables, respectively. (Note: this is assuming the sale dimension is within a date dimension group, which allows for reporting on the different date parts, such as month or year of the respective date.) This is the field by which increments are determined to update the PDT, and it partners with increment_offset. Thus, in this example, the past two months will be rebuilt within the PDT; all preceding dates on increment load will remain static. This allows us to limit the amount of data being queried and updated.
Why is this better than creating a separate derived table and persisting it? Aggregate awareness. Looker’s aggregate awareness finds the most efficient query for each explore. If aggregate tables are declared, aggregate awareness routes the query to that aggregate table within the parent explore. Thus, from a UX perspective, the end user does not notice a difference; they still employ the same explore. From a backend perspective, Looker routes to a PDT connected to the parent explore.
Returning back to the original example, what if we wanted to support year-over-year reporting now? No problem. Just add another aggregate table to the explore. Note how this addition is integrated within the parent explore instead of spinning up a separate, siloed view:
explore: sales { aggregate_table: _daily_sales { query: { dimensions: [ sale_date, customer, region, product, cost, sale_price, profit ] measures: [total_sale, total_profit, total_cost] filters: [ sale_date: "12 months" ] } materialization: { datagroup_trigger: daily_refresh increment_key: "sale_month" increment_offset: 2 partition_keys: [sale_date, customer] cluster_keys: [region, product] } } aggregate_table: _monthly_sales { query: { dimensions: [ sale_month, customer, region, product, cost, sale_price, profit ] measures: [total_sale, total_profit, total_cost] filters: [ sale_month: "12 months" ] } materialization: { datagroup_trigger: daily_refresh increment_key: "sale_month" increment_offset: 2 partition_keys: [sale_year, customer] cluster_keys: [region, product] } } aggregate_table: _annual_sales { query: { dimensions: [ sale_year, customer, region, product, cost, sale_price, profit ] measures: [total_sale, total_profit, total_cost] filters: [ sale_year: "2 years" ] } materialization: { datagroup_trigger: daily_refresh increment_key: "sale_year" increment_offset: 1 cluster_keys: [region, product] } } }
All that was needed was to add another aggregate table block. Note how similar this annual aggregate table is to the monthly or daily aggregate table from a code perspective. The only change we made was pointing to the year rendering of the sale date within the assumed dimension group.
As one can see by this simple example, aggregate tables are powerful. When engineered with forethought of scale, they can support a variety of different use cases and accelerate query speeds dramatically. And that is why we have begun to employ them for some of our larger customers. Let’s dive into how Harness is using aggregate tables internally and explore some efficiency savings we have realized as a result.
Our Cloud Cost Management (CCM) module not only allows one to monitor their cloud costs across multiple clouds, but also provides suggestions for savings and flags anomalies. This product allows one to save up to 70% on your cloud bill with intelligent cloud cost automation. Being such an invaluable tool in this current economic climate, many customers are turning to CCM to streamline their cloud spend and optimize their FinOps.
Especially for our larger customers, this demands large volumes of data to be ingested, analyzed and processed. For some of our larger customers, this was causing noticeable dashboarding friction for both Harness and the customer. Load times were exorbitant, inhibiting access to answers quickly; costs were mutating, as the queries for these large tables were processing TBs of data. We needed a solution that would accelerate dashboard speeds–fast. Enter aggregate tables.
Within a month, we were able to spin up a number of curated aggregate tables for one of our largest customers that satisfied a majority of their use cases, and the results were dramatic.
We saw a radical decrease in dashboard load times, from 10s of minutes in load to less than 10 seconds of load. Not only that, we saw a dramatic decrease in bytes scanned and processed as a result of using aggregate tables. See below for an example of dramatic byte savings:
In this example, we are processing the same query at a fraction of a single percentage–0.096% of the size, or 1000x more efficiently!
The end result was realized cost savings for Harness and much faster dashboard load times for the customer, a win-win.
Aggregate tables are a powerful mechanism to deliver subset PDTs of a parent explore. Engineered correctly, one can craft these to position it to scale to one’s data needs, positioning the practitioner to ensure efficiency and cost-optimization in the data platform, while also enhancing the customer experience in delivering timely, fast visualizations. Harness has and will continue to utilize aggregate tables to eliminate choke points in query processing to continue to provide a best-in-class customer experience in Custom Dashboards.