Beware of SQL averages over dates with empty days

It’s pretty common to use SQL to extract sales summaries from tables that represent sales orders as rows. These kinds of queries often use the AVG() aggregate function to summarise the sales volume. There is a major pitfall when doing this, though!

We might run a sub-query like this to get daily revenue totals from the orders table:

SELECT order_date,
       SUM(order_total) AS total_revenue
FROM orders
GROUP BY order_date

Those aggregated revenue query results might look something like this:

+------------+---------------+
| date       | total_revenue |
+------------+---------------+
| 2021-08-01 | 940.00        |
| 2021-08-02 | 795.00        |
| 2021-08-05 | 795.00        |
| 2021-08-06 | 1860.00       |
| 2021-08-08 | 495.00        |
+----------+-----------------+

If we want to calculate a daily average over a period like this, we might use AVG(total_revenue). In this case with these five rows that would be 977.00. This looks like it might be right, but it’s not.

The issue is that there are days with no orders in that range. 2021-08-03, 2021-08-04 and 2021-08-07 sadly had no sales. They should be in the aggregate query results with a total_revenue of 0.00, like this:

+------------+---------------+
| date       | total_revenue |
+------------+---------------+
| 2021-08-01 | 940.00        |
| 2021-08-02 | 795.00        |
| 2021-08-03 | 0.00          |
| 2021-08-04 | 0.00          |
| 2021-08-05 | 795.00        |
| 2021-08-06 | 1860.00       |
| 2021-08-07 | 0.00          |
| 2021-08-08 | 495.00        |
+----------+-----------------+

With those empty days included, the result of AVG(total_revenue) is now 610.625, which is the real average daily revenue in that range of dates.

This is a common pitfall when grouping by date in SQL queries, as it’s easy to forget that if there are no representative rows in the source table, then those rows will be absent in the aggregated results, instead of appearing as a zero value like they should.

So how can we ensure that we get rows for all the days, even if some of them had no sales?

We can re-arrange the query to be based on a date range, and then join the orders table on that, so that no absent days are skipped.

In Postgres, you can achieve this by selecting from the generate_series function which makes this relatively easy.

What about MySQL? It’s a bit harder to generate a date range in MySQL, but doable.

We end up with a query like this:

SELECT
    date_series.dates AS date,
    IF(SUM(orders.order_total) IS NULL, 0.00, SUM(orders.order_total)) AS total_revenue
FROM (
     SELECT series.dates
    FROM (
        SELECT DATE('2021-08-08') - INTERVAL units.pos DAY AS dates
        FROM (SELECT 0 AS pos UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) AS units
    ) series
    WHERE series.dates >= DATE('2021-08-08') - INTERVAL 7 DAY
         ) date_series
LEFT JOIN orders
    ON DATE(orders.created_at) = date_series.dates
GROUP BY date

Which gives results like this:

+------------+---------------+
| date       | total_revenue |
+------------+---------------+
| 2021-08-01 | 940.00        |
| 2021-08-02 | 795.00        |
| 2021-08-03 | 0.00          |
| 2021-08-04 | 0.00          |
| 2021-08-05 | 795.00        |
| 2021-08-06 | 1860.00       |
| 2021-08-07 | 0.00          |
| 2021-08-08 | 495.00        |
+----------+-----------------+

Notice that the days without any orders still get represented with a row with total_revenue as 0.00. It’s then safe to run an AVG() across that:

SELECT AVG(total_revenue)
FROM (
     SELECT
    date_series.dates AS date,
    IF(SUM(orders.order_total) IS NULL, 0.00, SUM(orders.order_total)) AS total_revenue
FROM (
     SELECT series.dates
    FROM (
        SELECT DATE('2021-08-26') - INTERVAL units.pos DAY AS dates
        FROM (SELECT 0 AS pos UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) AS units
    ) series
    WHERE series.dates >= DATE('2021-08-26') - INTERVAL 7 DAY
         ) date_series
LEFT JOIN orders
    ON DATE(orders.created_at) = date_series.dates
GROUP BY date
         ) AS daily_totals

We then get the correct average daily revenue in the period of 610.625.

By the way, if you need help with SQL databases in your business, you can hire me as a freelance database developer.


Tech mentioned