SQL Puzzle 2: Absentees
The second puzzle in Joe Celko’s SQL Puzzles is about tracking employee absenteeism.
The initial table structure is like this:
CREATE TABLE personnel (
emp_id integer PRIMARY KEY
);
CREATE TABLE excuse_list (
reason_code varchar(40) UNIQUE
);
CREATE TABLE absenteeism (
emp_id integer NOT NULL REFERENCES personnel (emp_id),
absent_date date NOT NULL,
reason_code varchar(40) NOT NULL REFERENCES excuse_list (reason_code),
severity_points integer NOT NULL
CONSTRAINT severity_points_range CHECK (severity_points BETWEEN 1 AND 4),
PRIMARY KEY (emp_id, absent_date)
);
There are already some nice properties to notice in this schema:
- All columns are
NOT NULL
, which is ideal. - The list of valid absence reasons is extracted into an explicit table that will be easier to work with than an enum column within the absenteeism table.
- There’s a simple but useful check constraint to ensure
severity_points
is valid (this could also have been a foreign key to a separate table). - There’s a natural primary key of
(emp_id, absent_date)
on theabsenteeism
table. This will improve consistency and let the database optimise queries more effectively.
The goal in this SQL puzzle is to make (kind of nasty!) decisions about the consequences of employee absenteeism using this schema, with fiddly rules around how to count and rate absence:
- If an employee gets >40 severity points with a one-year period, they get fired.
- Absences of more than one day in a row are counted as illness, not as absence, i.e. we only count single isolated days.
- We do want to be able to count illness days separately, they just aren’t considered for the absence-firing rule above.
We can change the schema if it makes this easier to achieve.
First solution: allow severity points to be zero
This approach solves the issue of not counting subsequent contiguous days as absence, but as illness instead, by giving them a severity score of zero.
To do that we need to allow zero as a valid value for the severity_points
column:
ALTER TABLE absenteeism
DROP CONSTRAINT severity_points_range
, ADD CONSTRAINT severity_points_range CHECK (severity_points BETWEEN 0 AND 4);
Note that this is a good reason to define check constraints with a name, i.e.
CONSTRAINT foobar CHECK (...)
and not just CHECK (...)
.
Celko makes a good point that it is a mistake to treat zero as a “waste of space” or not a real value. Zero is a number that is just as meaningful as any other. See also: zero, one, infinity rule.
We can then use a single update query to zero-out the days of absence that we want to treat as illness, i.e with zero severity:
INSERT INTO excuse_list (reason_code) VALUES ('long term illness');
UPDATE absenteeism
SET severity_points = 0,
reason_code = 'long term illness'
WHERE EXISTS (
SELECT *
FROM absenteeism AS A2
WHERE absenteeism.emp_id = A2.emp_id
AND absenteeism.absent_date = (A2.absent_date - INTERVAL '1 DAY')
);
Here we’re using a subquery to identify a day of absence for the same employee that is one day before the current one, i.e. all but the first in a contiguous series of days. If there is such a preceding day, we set severity points to zero and the reason code to “long term illness”.
It’s then straightforward to sum up the total absence severity points for all employees to see if anyone has accumulated 40 or more points:
SELECT emp_id, SUM(severity_points)
FROM absenteeism
GROUP BY emp_id;
This doesn’t seem to include the “within a one-year period” requirement described in the original problem statement, but maybe that is assumed.
With that you can then do a delete query like this to “fire” those employees (which seems a bit brutal but this is just an example after all):
DELETE FROM personnel
WHERE emp_id = (
SELECT a.emp_id
FROM absenteeism a
WHERE a.emp_id = personnel.emp_id
GROUP BY a.emp_id
HAVING SUM(severity_points) >= 40
);
Apparently the GROUP BY
is not needed in SQL-92+, only in older SQL
implementations, which is surprising to me. I would have thought it was
essential here.
It’s quite nice that the business rule is described pretty naturally in the SQL query.
This solution seems fine, although it would be nice if the database actually enforced the “severity score is zero if there is a preceding day of absence” constraint, rather than requiring an update query to keep that correct.
Second solution: include the “one-year period” requirement
Now it’s acknowledged that the original firing victim selection query did not consider the one-year period requirement, which can be added like this:
SELECT emp_id
FROM absenteeism
WHERE absent_date >= (CURRENT_TIMESTAMP - INTERVAL '365 DAYS')
AND absent_date < CURRENT_TIMESTAMP
GROUP BY emp_id
HAVING SUM(severity_points) >= 40;
That query uses a suggestion from SQL Performance Explained to use an explicit range condition for queries over continuous date periods.
The other issue raised in this second solution is that the previous solution
left absences in place, or at least would try to – the first delete query would
actually be rejected for breaking a foreign key constraint on the
absenteeism.emp_id
column.
As you might expect, the best way to deal with this is to let the database
handle it, in this case with an ON DELETE
clause when creating the table:
CREATE TABLE absenteeism (
emp_id integer NOT NULL REFERENCES personnel (emp_id) ON DELETE CASCADE,
absent_date date NOT NULL,
reason_code varchar(40) NOT NULL REFERENCES excuse_list (reason_code),
severity_points integer NOT NULL
CONSTRAINT severity_points_range CHECK (severity_points BETWEEN 1 AND 4),
PRIMARY KEY (emp_id, absent_date)
);
Another issue is also raised: the zero-points-for-illness update will only work if weekend days are also marked as absences, which is likely not to happen. The suggested solution is to also allow routinely marking weekends as absences with a severity score of zero and a dedicated reason_code.
I’m not a fan of this idea, as I think it conceptually breaks what the
absenteeism
table means in this domain. Anyone querying the absenteeism
table always has to know and remember to allow for this special weekends
workaround in their query. It’s a bit like tomb-stoning or soft deletes, with
similar problems.
Third solution: explicit calendar table
The above objection is now covered by the suggestion to use an explicit
calendar
table that describes working days (similar to the problem that
php-business-time and
ts-business-time address,
but solved with an SQL table).
CREATE TABLE calendar (
cal_date date NOT NULL UNIQUE,
day_type varchar(40)
);
With a calendar table that describes working days, you can then cleanly select the relevant days of absence:
SELECT a.emp_id
FROM absenteeism AS a, calendar as c
WHERE c.cal_date = a.absent_date
AND c.day_type = 'working'
AND a.absent_date >= (CURRENT_TIMESTAMP - INTERVAL '365 DAYS')
AND a.absent_date < CURRENT_TIMESTAMP
GROUP BY a.emp_id
HAVING SUM(severity_points) >= 40;
Joe Celko’s SQL Puzzles & Answers