Returning the entire row of maximum value for each group (Group-wise Maximum in SQL / MySQL )
NOTE: The content of this article is inaccurate and needs to be amended. If you wish to be notified when the update is made, you may subscribe via email.
Last week I was joyfully hacking away at a project, progressing from from one task to another. Then it came time to do a non-trivial SQL query, though at the time I still thought the query would be nothing more than a select with a GROUP BY. I was wrong.
The Problem
What I wanted was to return the most recent log for each task for a particular user.
Here’s my table:
+--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+ | log_id | user_id | task_id | when | duration | description | +--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+ | 1 | 1 | 1 | 0000-00-00 00:00:00 | 120 | getting logging working at a very basic level | | 2 | 1 | 1 | 2010-06-26 13:52:49 | 20 | polishing off log insertion | | 3 | 1 | 2 | 2010-06-26 15:00:44 | 120 | | | 4 | 1 | 3 | 2010-06-27 14:00:40 | 300 | over 30 serves for school lunches | | 6 | 1 | 4 | 2010-06-28 15:00:51 | 150 | YesMen - WordPress theme | | 7 | 1 | 4 | 2010-06-29 16:30:03 | 240 | on the YesMen WordPress theme | | 8 | 1 | 5 | 2010-06-30 17:32:13 | 240 | Checking out the market, looking at competition and discussions involving successful developers 12pm | | 9 | 1 | 1 | 2010-06-30 07:00:05 | 40 | | +--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+
Originally, this was the query that I thought would do it:
SELECT * FROM `logs` WHERE `user_id` = '1' GROUP BY `task_id` ORDER BY `when` DESC;
Which gave me:
+--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+ | log_id | user_id | task_id | when | duration | description | +--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+ | 8 | 1 | 5 | 2010-06-30 17:32:13 | 240 | Checking out the market, looking at competition and discussions involving successful developers 12pm | | 6 | 1 | 4 | 2010-06-28 15:00:51 | 150 | YesMen - WordPress theme | | 4 | 1 | 3 | 2010-06-27 14:00:40 | 300 | over 30 serves for school lunches | | 3 | 1 | 2 | 2010-06-26 15:00:44 | 120 | | | 1 | 1 | 1 | 0000-00-00 00:00:00 | 120 | getting logging working at a very basic level | +--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+
It looks fine, until you notice task_id 1.
“What?” I thought, “That’s the least recent log for task_id 1!”. Then it struck me, “Oh, ORDER BY is performed after GROUP BY.” (Edit)John Haugeland for bringing up a problem in this post. My expectation of valid results was unfounded because no explicit execution order is defined – it may not always (or ever) produce the correct results. In other words, I shouldn’t have expected it to work. Thanks to commenter
So I thought a bit and said “I know!” and this was my next attempt:
SELECT * FROM `logs` GROUP BY `task_id` HAVING `when` = MAX(`when`);
Which yielded:
+--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+ | log_id | user_id | task_id | when | duration | description | +--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+ | 3 | 1 | 2 | 2010-06-26 15:00:44 | 120 | | | 4 | 1 | 3 | 2010-06-27 14:00:40 | 300 | over 30 serves for school lunches | | 8 | 1 | 5 | 2010-06-30 17:32:13 | 240 | Checking out the market, looking at competition and discussions involving successful developers 12pm | +--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+
Though I was hoping that query would do what I wanted, I had a feeling it would do exactly this. It is similar to the last query except it completely removes the logs that aren’t the most recent. This is just as bad. By this point, I had done some searching and even went through my old RMIT Database Concepts Assignments looking for answers. “What now? Will I have to give in and use a subquery? Oh, please no…” As a final resort, I hopped over to #mysql on freenode (I’m SickAnimations).
Somebody linked me to a post about Groupwise Maximum queries on David North’s blog, which was about exactly the same problem as mine. However, the solution there didn’t to work for me – it gave the same results as my second attempt. Knowing the new term “Groupwise Maximum” and how to use The Google, I came across this little nugget on the official MySQL website: 3.6.4. The Rows Holding the Group-wise Maximum of a Certain Column. Jackpot.
The Solution
As I feared, the solution did indeed require a subquery. Here’s my application of MySQL’s suggestion solution:
SELECT * FROM `logs` outer_logs WHERE `outer_logs`.`user_id` = 1 AND `outer_logs`.`when` = ( SELECT MAX( `inner_logs`.`when` ) FROM LOGS inner_logs WHERE `outer_logs`.`task_id` = `inner_logs`.`task_id` )
Providing the desired output:
+--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+ | log_id | user_id | task_id | when | duration | description | +--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+ | 3 | 1 | 2 | 2010-06-26 15:00:44 | 120 | | | 4 | 1 | 3 | 2010-06-27 14:00:40 | 300 | over 30 serves for school lunches | | 7 | 1 | 4 | 2010-06-29 16:30:03 | 240 | on the YesMen WordPress theme | | 8 | 1 | 5 | 2010-06-30 17:32:13 | 240 | Checking out the market, looking at competition and discussions involving successful developers 12pm | | 17 | 1 | 1 | 2010-07-08 06:50:39 | 45 | Finalising transition to Linux development environment | +--------+---------+---------+---------------------+----------+--------------------------------------------------------------------------------------------------------+
Feels mostly good, except the subquery part. Oh well.
Image from mwin.




John Haugeland
That is an *awful* approach. Please don’t blog things you figured out in ignorance; people will attempt to follow this advice. There is no guarantee that your results will be correct, as the order and group by are not defined in execution order with respect to one another.
The appropriate approach is an atomic comparison, which will always give the correct answer regardless of implementation. One such approach – the default in this situation – is a column-wise self join comparing left to right such that right’s value is null (that is, “give me all the rows such that for the comparison there is no valid right side,”) which will always give the appropriate highest value (and in far better time besides.)
Hilariously, the *correct* answer is also provided on the MySQL page you linked. It’s just that when you get bad advice from the MySQL manual, it’s usually MySQL specific.
Please leave the SQL advice to people who understand SQL in the future.
Thanks.
(Groupwise maximums are non-trivial SQL now? Good lord, what’s trivial to you?)
July 9, 2010 @ 10:50 am
Tim
John, I appreciate your feedback and thanks for reading. I don’t claim to be an SQL expert and I thought the tone of this post was informal enough to make that clear. I’m not sure you read the post correctly because the solution (which is very clearly marked) uses neither GROUP BY nor ORDER BY.
I’ll have a look at self-joins on your advice.
In my opinion, this is non-trivial because it requires a subquery or, as you suggested, a column-wise self join. Such queries don’t fall under the ‘trivial’ category, at least not in my mind.
July 9, 2010 @ 2:10 pm
John Haugeland
“I don’t claim to be an SQL expert and I thought the tone of this post was informal enough to make that clear”
Yeah, you’re giving advice in public, then posting it to link sharing sites. If you want to mince words and suggest that that’s something other than you pretending to be an expert, fine; the point remains, you really should not be giving advice.
“I’m not sure you read the post correctly because the solution (which is very clearly marked) uses neither GROUP BY nor ORDER BY.”
I read the post correctly, sir. Perhaps you should re-read your own writing. That criticism was aimed at this verbatim quote:
““What?” I thought, “That’s the least recent log for task_id 1!”. Then it struck me, “Oh, ORDER BY is performed after GROUP BY. Now what do?” “
What I was trying to point out is that you had reacted to undefined behavior not going the way you expected by guessing, incorrectly, that the undefined behavior was just defined some other way. (Note that that’s where you leave the topic, too.)
The answer, of course, is something other than to accuse people who provide legitimate technical solutions to your problem of not having read correctly.
“In my opinion, this is non-trivial because it requires a subquery”
That’s … kind of ridiculous, you know. That’s like saying “this C++ program is non-trivial because it requires a function.”
I mean come on, that’s about as basic as SQL gets.
What I actually asked you, of course, was “what is trivial to you?”
The phrase “non-trivial” is how a trivial programmer makes themselves feel like any limitation they find in their own skillset reflects something deep and difficult about a language.
Let me remind you: you’re looking for a set of rows with high numbers. That’s _as_ _simple_ _as_ _production_ _SQL_ _ever_ _gets_.
That’s the *definition* of trivial SQL.
I’ll ask you again. Please this time focus on what I’m actually asking you.
What, in your mind, constitutes something significantly more trivial than
select * from foo as A left join foo as B on A.id = B.id where A.col < B.col and B.id is null;
I mean, Christ, really? That's non-trivial to you?
Please stop attempting to give advice on ridiculously simple topics in SQL in public, regardless of whether you feel your tone means that people shouldn't be taking you seriously, then explaining why one of the very simplest relational queries possible in the entire language is somehow non-trivial because it's not just a raw select.
Seriously, the solution you gave is dangerously incorrect; it can give extremely wrong results.
You *need* to not try to explain how to do SQL in the future. You're going to screw someone badly.
July 9, 2010 @ 3:12 pm
John Haugeland
Incidentally, your personal blog has been hacked, and currently redirects to a pharmacy website.
July 9, 2010 @ 3:14 pm
Tim
John, the linked and repeated solution in this post doesn’t give extremely incorrect results.
You provided an example query (select * from foo as A left join foo as B on A.id = B.id where A.col < B.col and B.id is null; ) and I am yet to get it to work. To be honest when I examine it I don't see how it would solve this problem, but that could just be my inexperience.
July 9, 2010 @ 5:41 pm
John Haugeland
> John, the linked and repeated solution in this post doesn’t give extremely incorrect results.
Yes, Tim, it does. That you haven’t tried it in any other SQL engine, and that you don’t appear to understand what undefined behavior is, is pretty not-compelling.
Try it in Postgres. Wrong rows. Try it in Oracle. Right rows. Try it in MSSQL. Wrong rows.
Why? Because the query is wrong, and the problem just happens not to fire under MySQL or Oracle because of the order their optimizers give.
It’s wrong, Tim, whether you understand it or not.
> You provided an example query (select * from foo as A left join foo as B on A.id = B.id where A.col To be honest when I examine it I don’t see how it would solve this problem
G’lord.
July 9, 2010 @ 11:52 pm
John Haugeland
> I know my personal WordPress has been hacked but fixing it is not high on my list of priorities. It doesn’t affect my clients.
Good *lord* . You’re distributing malware and feeding a spammer, knowingly, and you don’t intend to clean it up?
Ugh.
I’m done helping you, sir. Enjoy being unable to handle extremely simple topics in SQL while giving advice to the public.
July 9, 2010 @ 11:54 pm
John Haugeland
Also, I just noticed that there’s an injection in your blog theme. If you look at the source for comment six, notice that a less-than which was part of my code quote is making it through unquoted. (Helpfully, a side effect of this is that one sentence got bolded, so it’s easy to find.)
July 12, 2010 @ 3:01 am
Thymn
It seems that admins can insert arbitrary code into their posts but not anonymous users. See, you can write <b> … </b> to make things bold but not write <script> … </script> to write scripts. Or so I think…
July 13, 2010 @ 11:00 pm