Tuesday, March 23, 2010

The Dangers Of GroupBy

I had a complex join (three tables) with a group by and a where cause with three conditions. It seemed to be working, until I added more data; I eventually narrowed problem down to just a single table. Consider this query:

SELECT w.user_id, w.some_date, w.id
FROM mytable w
GROUP BY w.user_id

id is the primary key. That query returns all the data:

user_id some_date id
1 2010-02-01 1
5 2010-02-02 2
1 NULL 3
1 2010-03-22 4
1 2010-03-24 5

I'm only interested in the record with the latest date for each user, so I add a MAX on some_date and GROUP BY user_id:

SELECT w.user_id, MAX(w.some_date), w.id
FROM mytable w
GROUP BY w.user_id

That gives:

user_id MAX( w.some_date ) id
1 2010-03-24 1
5 2010-02-02 2

Hang on! The dates are right, but 2010-03-24 is from id=5, not id=1. The some_date column has been max-ed in isolation.

This issue is explained in an excellent two-part article: part1 part2

In fact it is so excellent I didn't fully understand it, and am going to re-read it three times a day until I do. I really need to understand it because, as the author shows, it is very easy to get correct answers from a query on your test data, pass all your unit tests, and deliver something that goes wrong when one more record is added.

My query is simpler than the one in his example, but after many tries I cannot find a solution. The "join(...)as d" syntax does not seem to work in MySQL or I am not using it correctly. It seems what I'm trying to do is very basic, so I do not understand why I cannot find more advice on the subject. I'm open to suggestions!

(For the moment I'm going to give up, do a simpler query and have PHP process the results to get the data I actually want.)


Keith said...

Not sure what you're trying to do, but Ben Forta's little "MySQL Crash Course" book (2005) has been exceptionally popular -- apparently seven reprintings in two and a half years -- because it explains the gotchas of joins, and the like, so clearly and simply.

Anonymous said...

Your ID isn't in the GROUP_BY, so it will be taken randomly (from the first row it encounters). You will need to rewrite your query to actually find the latest date (in a sub-query) and then select data from that record.


darren said...

Thanks Anonymous. The comments on the page you give are also useful, as they appear to introduce a method that is both clearer and more efficient.

(For the moment I'm happy with my PHP solution.)

Thanks for the suggestion Keith. I've just been reading the Amazon reviews and it is described as an excellent beginner book, but doesn't cover anything advanced. (I cannot work out a way to find out if will tell me anything I don't know without actually buying and reading it :-)