November 7, 2024

Optimizing Query Performance for Large Datasets Powering Dashboards

Table of Contents

In data-intensive applications, managing query timeouts and high storage expenses can pose significant challenges, particularly when dealing with billions of rows of data. For instance, Harness dashboards (which are currently powered by Looker) such as Unit Test Metrics were fueled by a tables containing over 4 billion rows. This often led to sluggish queries and elevated operational costs. In this blog post, we share our journey from slow query performance to rapid dashboard loading times.

The Problem: Slow Queries on a Massive Dataset

Our dashboards serve multiple customers by presenting test results filtered by failure rates and additional parameters such as pipeline_identifier and project_identifier. The database that supports this dashboard is built on a PostgreSQL-backed TimescaleDB instance. Over the course of six months, the Test_Results table has expanded to an impressive 4 billion rows, consuming 3TB of storage. As a result of the high volume of tests executed, our users have experienced query delays of 10-15 minutes and frequent timeouts. An internal analysis identified the following root causes:

  • Large Data Volume: The Test_Results table contained an enormous quantity of raw test results.
  • Inefficient Querying: We were scanning billions of rows for each dashboard query, resulting in prolonged load times despite the presence of indexes.
__wf_reserved_inherit
User Experience before and after optimization

The Solution: Aggregated Tables for Performance Optimization

To enhance query performance, we initiated the implementation of aggregated tables designed to hold pre-computed summaries. The data was categorized into two main types:

  1. Weekly Aggregated Data: This information is stored in the weekly_test_summary table.
  2. Daily Aggregated Data: This data is maintained in the daily_test_summary table.

These tables are organized based on key parameters, including suite_name, test_name, class_name, and result. This strategic aggregation significantly minimizes the number of rows queried, thereby accelerating response times.  
Note: The result of a test can be passed, failed, or skipped.

Example Schema

Test_Results

  • created_at (timestamp)
  • account_id (text)
  • build_id (text)
  • name (text)
  • suite_name (text)
  • class_name (text)
  • result (result_t)  

daily_test_summary Table

  • uuid (uuid)
  • aggregation_date (date)
  • test_name (text)
  • suite_name (text)
  • class_name (text)
  • total_count (integer)
  • failed_count (integer)
  • id (uuid)

weekly_test_summary Table

  • uuid (uuid)
  • aggregation_week (date)
  • test_name (text)
  • suite_name (text)
  • class_name (text)
  • total_count (integer)
  • failed_count (integer)
  • id (uuid)

Optimized Query Architecture: Parallel CTEs and Indexing

In our approach to managing data distributed across three tables—pipeline_details, daily_test_summary, and weekly_test_summary—we have developed an optimized query architecture:

  • Parallel Common Table Expressions (CTEs): We efficiently retrieved data in parallel from weekly, daily, and today's datasets.
  • Time-based Partitioning: We divided the date range into segments: weekly data (from Monday to Sunday), daily data (for non-weekly dates), and today's data (sourced from the Test_Results table).
  • Proper Indexing: By implementing indexes on aggregation columns (e.g., uuid, account_id, and aggregation_date), we achieved a remarkable 95% improvement in query performance, albeit with some additional storage overhead for the aggregated tables.
  • Why TimescaleDB: This database supports standard SQL queries and is specifically designed for time series data, offering robust storage, performance, and analytical capabilities for large-scale data.

The final dashboard query integrates data from all three tables, producing a cohesive result set. By utilizing parallel execution and effective indexing, we reduced query times from 15 minutes to just 15 seconds, even when processing several months' worth of data.

Data Migration and Daily Aggregation Jobs

To facilitate a seamless transition, we executed a one-time job that consolidated historical data from the Test_Results table into the newly established aggregated tables. To maintain up-to-date information, we implemented a daily aggregation job within a separate application, specifically a test management service, which automatically processes data every 24 hours and refreshes the aggregated tables.

Final Optimized Query Logic in Looker-

Data is retrieved from each CTE for specific time ranges. The time range for weekly_data spans from the first Monday to the last Sunday, sourced from weekly_test_summary. The remaining data is drawn from daily_data, specifically daily_test_summary, while today_data is obtained from the Test_Results table. To ensure optimal performance for our Looker dashboards, we implemented a derived table utilizing optimized CTEs, integrated Looker liquid logic for conditional querying, and aggregated data from pre-aggregated tables, ultimately merging it into a single result.

derived_table: {
 sql:
   WITH date_range AS (
     SELECT
       {% date_start date_range %}::date AS start_date,
       LEAST(COALESCE({% date_end date_range %}, CURRENT_DATE), CURRENT_DATE) AS end_date,
       DATE_TRUNC('week', {% date_start date_range %}::date + INTERVAL '7 days') AS week_start_date,
       DATE_TRUNC('week', LEAST(COALESCE({% date_end date_range %}, CURRENT_DATE), CURRENT_DATE)) - INTERVAL '1 day' AS week_end_date,
       CASE WHEN LEAST(COALESCE({% date_end date_range %}, CURRENT_DATE), CURRENT_DATE) >= CURRENT_DATE THEN 1 ELSE 0 END AS include_today
   ),
   -- Getting data from weekly aggregate table
   weekly_data AS (
     SELECT
       wts.suite_name,
       wts.class_name,
       wts.test_name,
       SUM(wts.total_count) AS total_count,
       SUM(wts.failed_count) AS failed_count
     FROM
       weekly_test_summary wts
     JOIN
       pipeline_details pd ON wts.uuid = pd.uuid
     WHERE
       wts.aggregation_week BETWEEN date_range.week_start_date AND date_range.week_end_date
       AND pd.account_id = {{ _user_attributes['account_id'] }}
     GROUP BY
       wts.suite_name, wts.class_name, wts.test_name
   ),
   -- Getting data from  daily aggregated table
   daily_data AS (
     SELECT
       dtd.suite_name,
       dtd.class_name,
       dtd.name,
       SUM(dtd.total_count) AS total_count,
       SUM(dtd.failed_count) AS failed_count
     FROM
       daily_test_summary dtd
     JOIN
       pipeline_details pd ON dtd.uuid = pd.uuid
     WHERE
       dtd.aggregation_date BETWEEN date_range.start_date AND date_range.end_date
       AND dtd.aggregation_date NOT BETWEEN date_range.week_start_date AND date_range.week_end_date
       AND pd.account_id = {{ _user_attributes['account_id'] }}
     GROUP BY
       dtd.suite_name, dtd.class_name, dtd.test_name
   ),
   -- Getting data from Test_Results for today's data
   today_data AS (
     SELECT
       e.suite_name, e.class_name, e.test_name
       COUNT(*) AS total_count,
       COUNT(CASE WHEN e.result IN ('failed', 'error') THEN 1 ELSE NULL END) AS failed_count
     FROM
       Test_Results e
     WHERE
       e.created_at >= CURRENT_DATE
       AND e.account_id = {{ _user_attributes['account_id'] }}
     GROUP BY
       e.suite_name, e.class_name, e.test_name
   )
   SELECT
     suite_name, class_name, name,
     SUM(total_count) AS total_count, SUM(failed_count) AS failed_count
   FROM (
   -- for aggregating data from all 3 CTE's
     SELECT * FROM weekly_data
     UNION ALL
     SELECT * FROM daily_data
     UNION ALL
     SELECT * FROM today_data WHERE (SELECT include_today FROM date_range) = 1
   ) combined_data
   GROUP BY suite_name, class_name, test_name
 ;;
}
__wf_reserved_inherit
Dashboard overview

Conclusion

Through the migration to pre-aggregated tables, the compression of older data, and the implementation of optimized query strategies, we achieved a remarkable 95% reduction in dashboard query time. This journey underscores the critical role of data aggregation, compression, and performance tuning in the efficient management of large-scale analytics systems, ultimately enhancing the user experience for dashboard users.

Database DevOps