Preamble
Let’s imagine the following situation: you have several users registered in your system and you want to gather information about them from some API (i.e. Facebook)
You store information in such simple structure:
CREATE TABLE public.users ( id bigserial NOT NULL, registered_at NOT NULL DEFAULT current_timestamp, PRIMARY KEY (id) ) CREATE TABLE public.user_requests ( id bigserial NOT NULL, user_id bigint, result boolean NOT NULL DEFAULT FALSE, requested_at timestamp with time zone NOT NULL DEFAULT current_timestamp, PRIMARY KEY (id), CONSTRAINT request_user_fk FOREIGN KEY (user_id) REFERENCES public.users (id) )
Prerequisites:
- If the last request was successful — then you should stop sending requests
- If user is registered more than 7 days ago — you should stop sending requests too
- If request failed — you want to store reason in «message column»
Now we need to gather list of users on which we should send the information request
The first try
The first thing I’ve tried was grouping and query like this:
SELECT latest.user_id FROM users LEFT JOIN ( SELECT user_id, result, max(requested_at) FROM user_requests GROUP BY user_id ) latest ON latest.user_id=users.id WHERE users.registered_at+'7 days'::interval >= current_timestamp;
Of course, this just won’t work, because the result isn’t in the group by and isn’t aggregated.
So we need to go a bit deeper and join a bit more things.
The second try
Well, let’s continue with joins:
SELECT users.id FROM users LEFT JOIN ( SELECT user_requests.user_id, user_requests.result FROM user_requests JOIN (SELECT user_id, max(requested_at) latest_request_time FROM user_requests GROUP BY user_id) latest ON latest.user_id = user_requests.user_id AND user_requests.requested_at = latest.latest_request_time ) results ON results.user_id = users.id WHERE users.registered_at + '7 days' :: INTERVAL >= CURRENT_TIMESTAMP AND (results.result IS NULL OR results.result = FALSE)
The Nice thing here: it works and works correctly. Bad thing: it works awfully.
Of course, we should create several indexes to improve performance:
CREATE INDEX user_registered_idx ON public.users USING btree (registered_at); CREATE INDEX user_id_idx ON public.user_requests USING btree (user_id); CREATE INDEX latest_request_idx ON public.user_requests USING btree (user_id, requested_at DESC NULLS LAST);
To be honest, I thought that it will be enough to create these indices. But… I was wrong.
Anyway, there will be full table scan on the user_requests table, because database needs to find all user_ids and only then find max latest request dates for them. It will need to list all the index from beginning and almost to the end. I was ready to surrender, but then I remembered about LATERAL
The third try: LATERAL
One can start reading about LATERAL subqueries here
Telling long things short, LATERAL allows us to use tables from FROM section of our query in the subquery.
SELECT users.id FROM users LEFT JOIN LATERAL ( SELECT user_id, result FROM user_requests WHERE user_id = users.id ORDER BY requested_at DESC NULLS LAST LIMIT 1 ) latest ON TRUE WHERE (latest.result IS NULL OR latest.result = FALSE) AND users.registered_at >= current_timestamp - '7 days' :: INTERVAL
Our query became muuuuch shorter and more readable. Also, it doesn’t perform full table scans, it only uses indices.
Here is a couple of queues on how it works:
- It finds users registered last 7 days (by index)
- For each user found in step 1 it tries to find him in user_requests
- If he isn’t found — then we get him (didn’t make any requests on him yet)
- If he’s found (in index!) — then we only need to take the first item from index because our index is sorted desc
On different datasets you will get different results, but you shouldn’t get seq scans in any sensible cases.
Lessons learned
Next time when I’ll think that I need complex aggregates — I will remember about LATERAL subqueries, which can be much faster and readable