Beware of SQL averages over dates with empty days
It’s pretty common to use SQL to extract sales summaries from tables that represent sales orders as rows. These kinds of queries often use the AVG() aggregate function to summarise the sales volume. There is a major pitfall when doing this, though!
We might run a sub-query like this to get daily revenue totals from the orders table:
SELECT order_date,
SUM(order_total) AS total_revenue
FROM orders
GROUP BY order_date
Those aggregated revenue query results might look something like this:
+------------+---------------+
| date | total_revenue |
+------------+---------------+
| 2021-08-01 | 940.00 |
| 2021-08-02 | 795.00 |
| 2021-08-05 | 795.00 |
| 2021-08-06 | 1860.00 |
| 2021-08-08 | 495.00 |
+----------+-----------------+
If we want to calculate a daily average over a period like this, we might use
AVG(total_revenue)
. In this case with these five rows that would be 977.00
.
This looks like it might be right, but it’s not.
The issue is that there are days with no orders in that range. 2021-08-03
,
2021-08-04
and 2021-08-07
sadly had no sales. They should be in the
aggregate query results with a total_revenue
of 0.00
, like this:
+------------+---------------+
| date | total_revenue |
+------------+---------------+
| 2021-08-01 | 940.00 |
| 2021-08-02 | 795.00 |
| 2021-08-03 | 0.00 |
| 2021-08-04 | 0.00 |
| 2021-08-05 | 795.00 |
| 2021-08-06 | 1860.00 |
| 2021-08-07 | 0.00 |
| 2021-08-08 | 495.00 |
+----------+-----------------+
With those empty days included, the result of AVG(total_revenue)
is now
610.625
, which is the real average daily revenue in that range of dates.
This is a common pitfall when grouping by date in SQL queries, as it’s easy to forget that if there are no representative rows in the source table, then those rows will be absent in the aggregated results, instead of appearing as a zero value like they should.
So how can we ensure that we get rows for all the days, even if some of them had no sales?
We can re-arrange the query to be based on a date range, and then join the orders table on that, so that no absent days are skipped.
In Postgres, you can achieve this by selecting from the generate_series function which makes this relatively easy.
What about MySQL? It’s a bit harder to generate a date range in MySQL, but doable.
We end up with a query like this:
SELECT
date_series.dates AS date,
IF(SUM(orders.order_total) IS NULL, 0.00, SUM(orders.order_total)) AS total_revenue
FROM (
SELECT series.dates
FROM (
SELECT DATE('2021-08-08') - INTERVAL units.pos DAY AS dates
FROM (SELECT 0 AS pos UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) AS units
) series
WHERE series.dates >= DATE('2021-08-08') - INTERVAL 7 DAY
) date_series
LEFT JOIN orders
ON DATE(orders.created_at) = date_series.dates
GROUP BY date
Which gives results like this:
+------------+---------------+
| date | total_revenue |
+------------+---------------+
| 2021-08-01 | 940.00 |
| 2021-08-02 | 795.00 |
| 2021-08-03 | 0.00 |
| 2021-08-04 | 0.00 |
| 2021-08-05 | 795.00 |
| 2021-08-06 | 1860.00 |
| 2021-08-07 | 0.00 |
| 2021-08-08 | 495.00 |
+----------+-----------------+
Notice that the days without any orders still get represented with a row with
total_revenue
as 0.00
. It’s then safe to run an AVG()
across that:
SELECT AVG(total_revenue)
FROM (
SELECT
date_series.dates AS date,
IF(SUM(orders.order_total) IS NULL, 0.00, SUM(orders.order_total)) AS total_revenue
FROM (
SELECT series.dates
FROM (
SELECT DATE('2021-08-26') - INTERVAL units.pos DAY AS dates
FROM (SELECT 0 AS pos UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9) AS units
) series
WHERE series.dates >= DATE('2021-08-26') - INTERVAL 7 DAY
) date_series
LEFT JOIN orders
ON DATE(orders.created_at) = date_series.dates
GROUP BY date
) AS daily_totals
We then get the correct average daily revenue in the period of 610.625
.
By the way, if you need help with SQL databases in your business, you can hire me as a freelance database developer.