You are viewing kristiannielsen

Kristian Nielsen - Selecting rows holding group-wise maximum of a field
November 21st, 2008
11:36 pm

[Link]

Previous Entry Share Next Entry
Selecting rows holding group-wise maximum of a field

Today there was a question on the Freenode MySQL channel about a classical problem: Rows holding group-wise maximum of a column. This is a problem that I keep encountering every so often, so I thought I would write up something about it.

A good example of the problem is a table like the following holding versioned objects:

CREATE TABLE object_versions (
  object_id INT NOT NULL,
  version INT NOT NULL,
  data VARCHAR(1000),
  PRIMARY KEY(object_id, version)
) ENGINE=InnoDB
Now it is easy to get the latest version for an object:
SELECT data FROM object_versions WHERE object_id = ? ORDER BY version DESC LIMIT 1
The query will even be very fast as it can use the index to directly fetch the right row:
mysql> EXPLAIN SELECT data FROM object_versions
WHERE object_id = 42 ORDER BY version DESC LIMIT 1;
+----+-------------+-----------------+------+---------------+---------+---------+-------+------+-------------+
| id | select_type | table           | type | possible_keys | key     | key_len | ref   | rows | Extra       |
+----+-------------+-----------------+------+---------------+---------+---------+-------+------+-------------+
|  1 | SIMPLE      | object_versions | ref  | PRIMARY       | PRIMARY | 4       | const |    3 | Using where | 
+----+-------------+-----------------+------+---------------+---------+---------+-------+------+-------------+
1 row in set (0.00 sec)

But what if we want to select the latest version of all (or some range of) objects? This is a problem that I think standard SQL (or any SQL dialect that I know of, including MySQL) has no satisfactory answer to.

Intuitively, the problem should be simple for the database engine to solve. Just traverse the BTree structure of the primary key (assume InnoDB clustered index storage here), and for each value of the first part of the primary key (object_id), pick the highest value of the second part (version) and return the corresponding row (this is similar to what I believe is sometimes called index skip scan). However, this idea is surprisingly difficult to express in SQL.

The first method suggested in the above link to the MySQL manual works in this case, but it is not all that great in my opinion. For example, it does not work well if the column that MAX is computed over is not unique per group (as it is in this example with versions); it will return all of the maximal rows which may or may not be what you wanted. And the query plan is not all that great either:

mysql> EXPLAIN SELECT data FROM object_versions o1 WHERE version =
(SELECT MAX(version) FROM object_versions o2 WHERE o1.object_id = o2.object_id);
+----+--------------------+-------+------+---------------+---------+---------+-----------------------+------+-------------+
| id | select_type        | table | type | possible_keys | key     | key_len | ref                   | rows | Extra       |
+----+--------------------+-------+------+---------------+---------+---------+-----------------------+------+-------------+
|  1 | PRIMARY            | o1    | ALL  | NULL          | NULL    | NULL    | NULL                  |    6 | Using where | 
|  2 | DEPENDENT SUBQUERY | o2    | ref  | PRIMARY       | PRIMARY | 4       | einstein.o1.object_id |    1 | Using index | 
+----+--------------------+-------+------+---------------+---------+---------+-----------------------+------+-------------+
2 rows in set (0.00 sec)
It is apparently doing a full table scan with an index lookup for every row in the table, which is not that bad, but certainly more expensive than necessary, especially if there are many versions per object. Still, it is probably the best method in most cases (or so I thought first, but see benchmarks below!).

The two other suggestions from the MySQL manual are not perfect either (though the first one is blazingly fast, see benchmarks below). One is to use an uncorrelated subquery with a join:

mysql> EXPLAIN SELECT o1.data FROM object_versions o1
INNER JOIN (SELECT object_id, MAX(version) AS version FROM object_versions GROUP BY object_id) o2
ON (o1.object_id = o2.object_id AND o1.version = o2.version);
+----+-------------+-----------------+--------+---------------+---------+---------+-------------------------+------+-------------+
| id | select_type | table           | type   | possible_keys | key     | key_len | ref                     | rows | Extra       |
+----+-------------+-----------------+--------+---------------+---------+---------+-------------------------+------+-------------+
|  1 | PRIMARY     | <derived2>      | ALL    | NULL          | NULL    | NULL    | NULL                    |    3 |             | 
|  1 | PRIMARY     | o1              | eq_ref | PRIMARY       | PRIMARY | 8       | o2.object_id,o2.version |    1 |             | 
|  2 | DERIVED     | object_versions | index  | NULL          | PRIMARY | 8       | NULL                    |    6 | Using index | 
+----+-------------+-----------------+--------+---------------+---------+---------+-------------------------+------+-------------+
At first, I actually did not know exactly how to interpret this plan output. After the benchmarks given below, I now think this plan is actually very good, apparently it is first using something like an index skip scan to compute the MAX() in the uncorrelated subquery, and then looking up each row using the primary key. It still has the issue with multiple rows if version was not unique per object.

The other suggestion uses an outer self-join:

mysql> EXPLAIN SELECT o1.data FROM object_versions o1
LEFT JOIN object_versions o2 ON o1.object_id = o2.object_id AND o1.version < o2.version
WHERE o2.object_id IS NULL;
+----+-------------+-------+------+---------------+---------+---------+-----------------------+------+--------------------------------------+
| id | select_type | table | type | possible_keys | key     | key_len | ref                   | rows | Extra                                |
+----+-------------+-------+------+---------------+---------+---------+-----------------------+------+--------------------------------------+
|  1 | SIMPLE      | o1    | ALL  | NULL          | NULL    | NULL    | NULL                  |    6 |                                      | 
|  1 | SIMPLE      | o2    | ref  | PRIMARY       | PRIMARY | 4       | einstein.o1.object_id |    1 | Using where; Using index; Not exists | 
+----+-------------+-------+------+---------------+---------+---------+-----------------------+------+--------------------------------------+
2 rows in set (0.00 sec)
The plan again looks reasonable, but not optimal. And somehow, all three methods feel unnatural for something that ought to be simple to express.

And in fact, there is a nice way to express this in SQL, except that it does not work (at least not in MySQL):

SELECT MAX(version, data) FROM object_versions GROUP BY object_id;
If there was just support for computing MAX() over multiple columns like this, this query would be a nice, natural, and simple way to express our problem. And it would be relatively easy for database engines to create the optimal plan, I think index skip scan is fairly standard already for single-column MAX() with GROUP BY. And the syntax feels very natural, even though it does bend the rules somehow by a single expression (MAX(version, data)) returning multiple columns. I have half a mind to try to implement it myself in MySQL or Drizzle one day ...

In fact, one can almost use this technique by an old trick-of-the-trade:

mysql> SELECT MAX(CONCAT(version, ":", data)) FROM object_versions GROUP BY object_id;
+---------------------------------+
| max(concat(version, ":", data)) |
+---------------------------------+
| 2:foo2                          | 
| 1:bar                           | 
| 3:baz2                          | 
+---------------------------------+
3 rows in set (0.00 sec)

mysql> EXPLAIN SELECT MAX(CONCAT(version, ":", data)) FROM object_versions GROUP BY object_id;
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------+
| id | select_type | table           | type  | possible_keys | key     | key_len | ref  | rows | Extra |
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------+
|  1 | SIMPLE      | object_versions | index | NULL          | PRIMARY | 8       | NULL |    6 |       | 
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+-------+
1 row in set (0.00 sec)
Though I consider this (and variations thereof) a hack with limited practical usage.

And speaking of hacks, there is actually another way to solve the problem, one which I learned about recently at a customer:

mysql> EXPLAIN SELECT data FROM
(SELECT * FROM object_versions ORDER BY object_id DESC, version DESC) t GROUP BY object_id;
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+---------------------------------+
| id | select_type | table           | type  | possible_keys | key     | key_len | ref  | rows | Extra                           |
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+---------------------------------+
|  1 | PRIMARY     | <derived2>      | ALL   | NULL          | NULL    | NULL    | NULL |    6 | Using temporary; Using filesort | 
|  2 | DERIVED     | object_versions | index | NULL          | PRIMARY | 8       | NULL |    6 |                                 | 
+----+-------------+-----------------+-------+---------------+---------+---------+------+------+---------------------------------+
Get it? This cleverly/evilly (ab)uses the MySQL non-standard extension which allows SELECT of columns not in the GROUP BY clause even without using aggregate functions. MySQL will return a value for the column from an "arbitrary" row in the group. In practise, it chooses it deterministically from the first row in the group, which is why this trick seems to work well in practise. But it is clearly documented to not be supported, so not really something to recommend, though interesting to see.

Bonus benchmark

As a free bonus, I decided to run some quick benchmarks. As it turns out, the results are quite surprising!

So I filled the table above with 1,000,000 rows, 1000 objects each with 1000 versions. Total table size is about 50Mb or so. I then ran each of the five above queries:

mysql> SELECT data FROM object_versions o1
WHERE version = (SELECT MAX(version) FROM object_versions o2 WHERE o1.object_id = o2.object_id);
1000 rows in set (4 min 22.86 sec)

mysql> SELECT o1.data FROM object_versions o1
INNER JOIN (SELECT object_id, MAX(version) AS version FROM object_versions GROUP BY object_id) o2
ON (o1.object_id = o2.object_id AND o1.version = o2.version);
1000 rows in set (0.01 sec)

mysql> SELECT o1.data FROM object_versions o1
LEFT JOIN object_versions o2 ON o1.object_id = o2.object_id AND o1.version < o2.version
WHERE o2.object_id IS NULL;
1000 rows in set (2 min 42.72 sec)

mysql> SELECT MAX(CONCAT(version, ":", data)) FROM object_versions GROUP BY object_id;
1000 rows in set (0.63 sec)

mysql> SELECT data FROM
(SELECT * FROM object_versions ORDER BY object_id DESC, version DESC) t GROUP BY object_id;
1000 rows in set (15.61 sec)

The differences are huge!

The clear winner is query 2, the uncorrelated subquery. Apparently it can do an index skip scan for the inner MAX/GROUP BY query, followed with a primary key join, so it only ever has to touch 1000 rows. An almost optimal plan.

Query 1 and 3 (correlated subquery and outer join) are spectacularly bad. It looks as if they are doing something like for each 1,000,000 rows in the full table scan, do a 1000 row index range scan in the subquery/join, for a total of 1 billion rows examined. Or something. Not good.

Query 4 and 5, the trick queries, are doing so-so, probably they get away with a sort of a full table scan of 1,000,000 rows.

Conclusions: Uncorrelated subquery is the undisputed winner!

Tags: ,

(15 comments | Leave a comment)

Comments
 
[User Picture]
From:arjen_lentz
Date:November 22nd, 2008 03:21 am (UTC)

Jan Kneschke collected this a few years back...

(Link)
From:kristiannielsen
Date:November 22nd, 2008 05:24 am (UTC)

Re: Jan Kneschke collected this a few years back...

(Link)
Cool, thanks for the link!
From:snehapadma.blogspot.com
Date:November 22nd, 2008 05:31 am (UTC)

Thanks for the Bonus

(Link)
Hey!

Thanks for the Benchmark. I used query type 2 for a project, and now I'm sure that it gives fastest result.
From:(Anonymous)
Date:November 22nd, 2008 09:22 am (UTC)

SQL Standard and GROUP BY

(Link)
Hi!

nice post. This:

"This cleverly/evilly (ab)uses the MySQL non-standard extension which allows SELECT of columns not in the GROUP BY clause even without using aggregate functions."

is not true though - not completely. You are right that your sample query is not valid according to the standard, but for a different reason (depending on which version of the standard you're looking at)

The 1992 version of the standard required SELECT-ed columns to be either part of the GROUP BY list or wrapped in an aggregate function, the 1999 and 2003 versions of the standard allow non-GROUP BY columns to be SELECT-ed even when they are not wrapped in an aggregate function as long as such a column is functionally dependent upon the GROUP BY list.

(http://rpbouman.blogspot.com/2007/05/debunking-group-by-myths.html)

There is BTW another MySQL specific hack to get the groupwise maximum:

SELECT SUBSTRING_INDEX(GROUP_CONCAT(data ORDER BY version), ',', 1)
FROM object_versions
GROUP BY object_id
[User Picture]
From:awfief
Date:November 23rd, 2008 04:22 pm (UTC)
(Link)
BTW, one way to augment this is to have the uncorrelated subquery as a view definition -- that way your developers can do a simple select from the VIEW as if it were a regular table. (of course you want to EXPLAIN queries because views can easily get turned into derived tables with a seemingly innocent query)
From:(Anonymous)
Date:November 25th, 2008 02:41 pm (UTC)

A different way...

(Link)
I use an "expired"(expired DATETIME DEFAULT NULL,) column instead of a "version" column. If you use UNIQUE(object_id,expired) you can find all the newest versions simply by:

SELECT object_id, data FROM object_versions WHERE expired IS NULL

The disadvantage is that you have to expire the former version before you insert the new one. But it adds information to your system, as you can easily see when your versions were valid.

Best regards,
te2720(hosted at)yahoo.dk
From:kristiannielsen
Date:November 27th, 2008 09:38 pm (UTC)
(Link)
In another discussion on #mysql, it was pointed out that the multiple row issue can be solved with an extra GROUP BY on the outer query. Something like this, for example:

SELECT o1.object_id, o1.version, o1.data
FROM object_versions o1
INNER JOIN (SELECT object_id, MAX(version) AS version FROM object_versions GROUP BY object_id) o2 ON (o1.object_id = o2.object_id AND o1.version = o2.version)
GROUP BY object_id;

(in this example (object_id, version) is unique, so there is no possibility of multiple rows, but this will not generally be true).
From:xaprb
Date:November 28th, 2008 04:34 am (UTC)

MAX(version, data)

(Link)
This is actually really easy. No need for hacking server code.

MAX(GREATEST(version, data))
From:kristiannielsen
Date:November 28th, 2008 06:34 am (UTC)

Re: MAX(version, data)

(Link)
No, what I meant is something different, sorry for not explaining properly. The idea is that MAX(a,b) returns the maximum _tuple_ (a,b), ie. it returns _two_ columns.

So the query would be this:

SELECT MAX(version, data) FROM object_versions GROUP BY object_id;

And it would return something like this:

version data
1 mydata
10 version 10 of something
147 string with lots of versions
...

So MAX with multiple coloumns would work similar to GROUP BY with multiple columns.

Unfortunately it is not all that trivial to implement.
From:(Anonymous)
Date:October 30th, 2009 09:50 am (UTC)

how to get the second highest salary department wise

(Link)
i have a table
emp(emp_name,department,salary).
i want to find out the second highest salary department wise.

would you please help to solve this problem ?
From:(Anonymous)
Date:October 30th, 2009 11:23 am (UTC)

Re: how to get the second highest salary department wise

(Link)
SELECT a.department, MAX(a.salary)
FROM emp a
INNER JOIN (SELECT department, MAX(salary) maxsal FROM emp GROUP BY department) b
ON (a.department = b.department)
WHERE a.salary < b.maxsal
GROUP BY b.department;

But depending on your needs, it might be more flexible to just SELECT * FROM emp ORDER BY department, salary, and do the processing in your application language.
From:(Anonymous)
Date:July 16th, 2010 09:20 pm (UTC)

<3

(Link)
Your benchmarks here changed my life. Ive been solving this problem for a long, long time now with the pure join-based solution. For YEARS, really. and yes, other people have collected this list before, but no one ever went all out and proved that one of them shines above the rest.

Ive long had issues with the pure-join solution not being "fast enough", but Ive just lived with it assuming it was just "one of those MySQL things". In moving to the Query 2 recommended here, Ive taken queries that were taking 15 minutes and cut them down to .41 seconds!!
From:(Anonymous)
Date:April 11th, 2011 02:11 am (UTC)

Just saying "hi" from the Gold Coast, Australia - Looking Forward To Getting Involved.

(Link)
Thanks...this looks really interesting. I am looking forward to having my say!
From:(Anonymous)
Date:June 9th, 2012 08:36 am (UTC)
(Link)
Thank you very much! Just what I looking for.
with a few index tweak, I can optimize my "expensive" query.
From:(Anonymous)
Date:December 9th, 2012 07:34 am (UTC)
(Link)
Huge help! Much appreciated.

The query I do is based on multiple column criteria and grouping. So a bit more involved, but the same basic idea.

SELECT t1.* FROM t as t1 INNER JOIN
(SELECT c1,c2,c3,max(c4) as c4,min(c5) as c5 GROUP BY c1,c2,c3) t2
ON (t1.c1 = t2.c1 AND t1.c2=t2.c2 AND t1.c3=t2.c3)
ORDER BY t1.c1 ASC

Your post took my query on a 50k+ records set from minutes to <1 second.

Then further creating a non-unique index on c1,c2,c3 (the group terms), took that result down by 99%. Huge savings!

Powered by LiveJournal.com