SQL Puzzle 2: Absentees

2020-03-02 (Monday) | 1200 words (~6 minutes reading)

Notes to myself.

The second puzzle in Joe Celko’s SQL Puzzles is about tracking employee absenteeism.

The initial table structure is like this:

CREATE TABLE personnel (
	emp_id integer PRIMARY KEY
);

CREATE TABLE excuse_list (
	reason_code varchar(40) UNIQUE
);

CREATE TABLE absenteeism (
	emp_id integer NOT NULL REFERENCES personnel (emp_id),
	absent_date date NOT NULL,
	reason_code varchar(40) NOT NULL REFERENCES excuse_list (reason_code),
	severity_points integer NOT NULL
		CONSTRAINT severity_points_range CHECK (severity_points BETWEEN 1 AND 4),
	PRIMARY KEY (emp_id, absent_date)
);

There are already some nice properties to notice in this schema:

All columns are NOT NULL, which is ideal.
The list of valid absence reasons is extracted into an explicit table that will be easier to work with than an enum column within the absenteeism table.
There’s a simple but useful check constraint to ensure severity_points is valid (this could also have been a foreign key to a separate table).
There’s a natural primary key of (emp_id, absent_date) on the absenteeism table. This will improve consistency and let the database optimise queries more effectively.

The goal in this SQL puzzle is to make (kind of nasty!) decisions about the consequences of employee absenteeism using this schema, with fiddly rules around how to count and rate absence:

If an employee gets >40 severity points with a one-year period, they get fired.
Absences of more than one day in a row are counted as illness, not as absence, i.e. we only count single isolated days.
We do want to be able to count illness days separately, they just aren’t considered for the absence-firing rule above.

We can change the schema if it makes this easier to achieve.

First solution: allow severity points to be zero

This approach solves the issue of not counting subsequent contiguous days as absence, but as illness instead, by giving them a severity score of zero.

To do that we need to allow zero as a valid value for the severity_points column:

ALTER TABLE absenteeism
	DROP CONSTRAINT severity_points_range
	, ADD  CONSTRAINT severity_points_range CHECK (severity_points BETWEEN 0 AND 4);

Note that this is a good reason to define check constraints with a name, i.e. CONSTRAINT foobar CHECK (...) and not just CHECK (...).

Celko makes a good point that it is a mistake to treat zero as a “waste of space” or not a real value. Zero is a number that is just as meaningful as any other. See also: zero, one, infinity rule.

We can then use a single update query to zero-out the days of absence that we want to treat as illness, i.e with zero severity:

INSERT INTO excuse_list (reason_code) VALUES ('long term illness');

UPDATE absenteeism
	SET severity_points = 0,
	reason_code = 'long term illness'
WHERE EXISTS (
	SELECT *
	FROM absenteeism AS A2
	WHERE absenteeism.emp_id = A2.emp_id
		AND absenteeism.absent_date = (A2.absent_date - INTERVAL '1 DAY')
);

Here we’re using a subquery to identify a day of absence for the same employee that is one day before the current one, i.e. all but the first in a contiguous series of days. If there is such a preceding day, we set severity points to zero and the reason code to “long term illness”.

It’s then straightforward to sum up the total absence severity points for all employees to see if anyone has accumulated 40 or more points:

SELECT emp_id, SUM(severity_points)
FROM absenteeism
GROUP BY emp_id;

This doesn’t seem to include the “within a one-year period” requirement described in the original problem statement, but maybe that is assumed.

With that you can then do a delete query like this to “fire” those employees (which seems a bit brutal but this is just an example after all):

DELETE FROM personnel
	WHERE emp_id = (
		SELECT a.emp_id
		FROM absenteeism a
		WHERE a.emp_id = personnel.emp_id
		GROUP BY a.emp_id
		HAVING SUM(severity_points) >= 40
	);

Apparently the GROUP BY is not needed in SQL-92+, only in older SQL implementations, which is surprising to me. I would have thought it was essential here.

It’s quite nice that the business rule is described pretty naturally in the SQL query.

This solution seems fine, although it would be nice if the database actually enforced the “severity score is zero if there is a preceding day of absence” constraint, rather than requiring an update query to keep that correct.

Second solution: include the “one-year period” requirement

Now it’s acknowledged that the original firing victim selection query did not consider the one-year period requirement, which can be added like this:

SELECT emp_id
FROM absenteeism
WHERE absent_date >= (CURRENT_TIMESTAMP - INTERVAL '365 DAYS')
  AND absent_date < CURRENT_TIMESTAMP
GROUP BY emp_id
HAVING SUM(severity_points) >= 40;

That query uses a suggestion from SQL Performance Explained to use an explicit range condition for queries over continuous date periods.

The other issue raised in this second solution is that the previous solution left absences in place, or at least would try to – the first delete query would actually be rejected for breaking a foreign key constraint on the absenteeism.emp_id column.

As you might expect, the best way to deal with this is to let the database handle it, in this case with an ON DELETE clause when creating the table:

CREATE TABLE absenteeism (
	emp_id integer NOT NULL REFERENCES personnel (emp_id) ON DELETE CASCADE,
	absent_date date NOT NULL,
	reason_code varchar(40) NOT NULL REFERENCES excuse_list (reason_code),
	severity_points integer NOT NULL
		CONSTRAINT severity_points_range CHECK (severity_points BETWEEN 1 AND 4),
	PRIMARY KEY (emp_id, absent_date)
);

Another issue is also raised: the zero-points-for-illness update will only work if weekend days are also marked as absences, which is likely not to happen. The suggested solution is to also allow routinely marking weekends as absences with a severity score of zero and a dedicated reason_code.

I’m not a fan of this idea, as I think it conceptually breaks what the absenteeism table means in this domain. Anyone querying the absenteeism table always has to know and remember to allow for this special weekends workaround in their query. It’s a bit like tomb-stoning or soft deletes, with similar problems.

Third solution: explicit calendar table

The above objection is now covered by the suggestion to use an explicit calendar table that describes working days (similar to the problem that php-business-time and ts-business-time address, but solved with an SQL table).

CREATE TABLE calendar (
	cal_date date NOT NULL UNIQUE,
	day_type varchar(40)
);

With a calendar table that describes working days, you can then cleanly select the relevant days of absence:

SELECT a.emp_id
FROM absenteeism AS a, calendar as c
WHERE c.cal_date = a.absent_date
  AND c.day_type = 'working'
  AND a.absent_date >= (CURRENT_TIMESTAMP - INTERVAL '365 DAYS')
  AND a.absent_date < CURRENT_TIMESTAMP
GROUP BY a.emp_id
HAVING SUM(severity_points) >= 40;

Joe Celko’s SQL Puzzles & Answers

NotesToSelf.Dev

SQL Puzzle 2: Absentees

First solution: allow severity points to be zero

Second solution: include the “one-year period” requirement

Third solution: explicit calendar table

Tech mentioned