In data-intensive applications, managing query timeouts and high storage expenses can pose significant challenges, particularly when dealing with billions of rows of data. For instance, Harness dashboards (which are currently powered by Looker) such as Unit Test Metrics were fueled by a tables containing over 4 billion rows. This often led to sluggish queries and elevated operational costs. In this blog post, we share our journey from slow query performance to rapid dashboard loading times.
The Problem: Slow Queries on a Massive Dataset
Our dashboards serve multiple customers by presenting test results filtered by failure rates and additional parameters such as pipeline_identifier
and project_identifier
. The database that supports this dashboard is built on a PostgreSQL-backed TimescaleDB instance. Over the course of six months, the Test_Results
table has expanded to an impressive 4 billion rows, consuming 3TB of storage. As a result of the high volume of tests executed, our users have experienced query delays of 10-15 minutes and frequent timeouts. An internal analysis identified the following root causes:
The Solution: Aggregated Tables for Performance Optimization
To enhance query performance, we initiated the implementation of aggregated tables designed to hold pre-computed summaries. The data was categorized into two main types:
weekly_test_summary
table.daily_test_summary
table.These tables are organized based on key parameters, including suite_name, test_name, class_name, and result. This strategic aggregation significantly minimizes the number of rows queried, thereby accelerating response times.
Note: The result of a test can be passed, failed, or skipped.
Example Schema
Test_Results
created_at
(timestamp)account_id
(text)build_id
(text)name
(text)suite_name
(text)class_name
(text)result
(result_t) daily_test_summary Table
uuid
(uuid)aggregation_date
(date)test_name
(text)suite_name
(text)class_name
(text)total_count
(integer)failed_count
(integer)id
(uuid)weekly_test_summary Table
uuid
(uuid)aggregation_week
(date)test_name
(text)suite_name
(text)class_name
(text)total_count
(integer)failed_count
(integer)id
(uuid)Optimized Query Architecture: Parallel CTEs and Indexing
In our approach to managing data distributed across three tables—pipeline_details
, daily_test_summary
, and weekly_test_summary
—we have developed an optimized query architecture:
Test_Results
table).uuid
, account_id
, and aggregation_date
), we achieved a remarkable 95% improvement in query performance, albeit with some additional storage overhead for the aggregated tables.The final dashboard query integrates data from all three tables, producing a cohesive result set. By utilizing parallel execution and effective indexing, we reduced query times from 15 minutes to just 15 seconds, even when processing several months' worth of data.
Data Migration and Daily Aggregation Jobs
To facilitate a seamless transition, we executed a one-time job that consolidated historical data from the Test_Results table into the newly established aggregated tables. To maintain up-to-date information, we implemented a daily aggregation job within a separate application, specifically a test management service, which automatically processes data every 24 hours and refreshes the aggregated tables.
Final Optimized Query Logic in Looker-
Data is retrieved from each CTE for specific time ranges. The time range for weekly_data
spans from the first Monday to the last Sunday, sourced from weekly_test_summary
. The remaining data is drawn from daily_data
, specifically daily_test_summary
, while today_data
is obtained from the Test_Results
table. To ensure optimal performance for our Looker dashboards, we implemented a derived table utilizing optimized CTEs, integrated Looker liquid logic for conditional querying, and aggregated data from pre-aggregated tables, ultimately merging it into a single result.
derived_table: {
sql:
WITH date_range AS (
SELECT
{% date_start date_range %}::date AS start_date,
LEAST(COALESCE({% date_end date_range %}, CURRENT_DATE), CURRENT_DATE) AS end_date,
DATE_TRUNC('week', {% date_start date_range %}::date + INTERVAL '7 days') AS week_start_date,
DATE_TRUNC('week', LEAST(COALESCE({% date_end date_range %}, CURRENT_DATE), CURRENT_DATE)) - INTERVAL '1 day' AS week_end_date,
CASE WHEN LEAST(COALESCE({% date_end date_range %}, CURRENT_DATE), CURRENT_DATE) >= CURRENT_DATE THEN 1 ELSE 0 END AS include_today
),
-- Getting data from weekly aggregate table
weekly_data AS (
SELECT
wts.suite_name,
wts.class_name,
wts.test_name,
SUM(wts.total_count) AS total_count,
SUM(wts.failed_count) AS failed_count
FROM
weekly_test_summary wts
JOIN
pipeline_details pd ON wts.uuid = pd.uuid
WHERE
wts.aggregation_week BETWEEN date_range.week_start_date AND date_range.week_end_date
AND pd.account_id = {{ _user_attributes['account_id'] }}
GROUP BY
wts.suite_name, wts.class_name, wts.test_name
),
-- Getting data from daily aggregated table
daily_data AS (
SELECT
dtd.suite_name,
dtd.class_name,
dtd.name,
SUM(dtd.total_count) AS total_count,
SUM(dtd.failed_count) AS failed_count
FROM
daily_test_summary dtd
JOIN
pipeline_details pd ON dtd.uuid = pd.uuid
WHERE
dtd.aggregation_date BETWEEN date_range.start_date AND date_range.end_date
AND dtd.aggregation_date NOT BETWEEN date_range.week_start_date AND date_range.week_end_date
AND pd.account_id = {{ _user_attributes['account_id'] }}
GROUP BY
dtd.suite_name, dtd.class_name, dtd.test_name
),
-- Getting data from Test_Results for today's data
today_data AS (
SELECT
e.suite_name, e.class_name, e.test_name
COUNT(*) AS total_count,
COUNT(CASE WHEN e.result IN ('failed', 'error') THEN 1 ELSE NULL END) AS failed_count
FROM
Test_Results e
WHERE
e.created_at >= CURRENT_DATE
AND e.account_id = {{ _user_attributes['account_id'] }}
GROUP BY
e.suite_name, e.class_name, e.test_name
)
SELECT
suite_name, class_name, name,
SUM(total_count) AS total_count, SUM(failed_count) AS failed_count
FROM (
-- for aggregating data from all 3 CTE's
SELECT * FROM weekly_data
UNION ALL
SELECT * FROM daily_data
UNION ALL
SELECT * FROM today_data WHERE (SELECT include_today FROM date_range) = 1
) combined_data
GROUP BY suite_name, class_name, test_name
;;
}
Conclusion
Through the migration to pre-aggregated tables, the compression of older data, and the implementation of optimized query strategies, we achieved a remarkable 95% reduction in dashboard query time. This journey underscores the critical role of data aggregation, compression, and performance tuning in the efficient management of large-scale analytics systems, ultimately enhancing the user experience for dashboard users.