Night studies: LATERAL

Preamble

Let’s imagine the following situation: you have several users registered in your system and you want to gather information about them from some API (i.e. Facebook)

You store information in such simple structure:

CREATE TABLE public.users ( 
  id bigserial NOT NULL, 
  registered_at NOT NULL DEFAULT current_timestamp, 
  PRIMARY KEY (id) ) 
CREATE TABLE public.user_requests ( 
  id bigserial NOT NULL, 
  user_id bigint, 
  result boolean NOT NULL DEFAULT FALSE, 
  requested_at timestamp with time zone NOT NULL DEFAULT current_timestamp, 
  PRIMARY KEY (id), 
  CONSTRAINT request_user_fk FOREIGN KEY (user_id) REFERENCES public.users (id) 
)

Prerequisites:

If the last request was successful — then you should stop sending requests
If user is registered more than 7 days ago — you should stop sending requests too
If request failed — you want to store reason in «message column»

Now we need to gather list of users on which we should send the information request

The first try

The first thing I’ve tried was grouping and query like this:

SELECT latest.user_id
FROM users
LEFT JOIN (
  SELECT user_id, result, max(requested_at)
    FROM user_requests
    GROUP BY user_id
) latest ON latest.user_id=users.id
WHERE users.registered_at+'7 days'::interval >= current_timestamp;

Of course, this just won’t work, because the result isn’t in the group by and isn’t aggregated.

So we need to go a bit deeper and join a bit more things.

The second try

Well, let’s continue with joins:

SELECT users.id
FROM users
  LEFT JOIN (
              SELECT
                user_requests.user_id,
                user_requests.result
              FROM user_requests
                JOIN
                (SELECT
                   user_id,
                   max(requested_at) latest_request_time
                 FROM user_requests
                 GROUP BY user_id) latest
                  ON latest.user_id = user_requests.user_id AND user_requests.requested_at = latest.latest_request_time
            ) results ON results.user_id = users.id
WHERE users.registered_at + '7 days' :: INTERVAL >= CURRENT_TIMESTAMP
      AND (results.result IS NULL OR results.result = FALSE)

The Nice thing here: it works and works correctly. Bad thing: it works awfully.

Of course, we should create several indexes to improve performance:

CREATE INDEX user_registered_idx ON public.users USING btree (registered_at);
CREATE INDEX user_id_idx ON public.user_requests USING btree (user_id);
CREATE INDEX latest_request_idx ON public.user_requests USING btree (user_id, requested_at DESC NULLS LAST);

To be honest, I thought that it will be enough to create these indices. But… I was wrong.

Anyway, there will be full table scan on the user_requests table, because database needs to find all user_ids and only then find max latest request dates for them. It will need to list all the index from beginning and almost to the end. I was ready to surrender, but then I remembered about LATERAL

The third try: LATERAL

One can start reading about LATERAL subqueries here

Telling long things short, LATERAL allows us to use tables from FROM section of our query in the subquery.

SELECT users.id
FROM users
  LEFT JOIN LATERAL (
            SELECT
              user_id,
              result
            FROM user_requests
            WHERE user_id = users.id
            ORDER BY requested_at DESC NULLS LAST
            LIMIT 1
            ) latest ON TRUE
WHERE
  (latest.result IS NULL OR latest.result = FALSE) AND
  users.registered_at >= current_timestamp - '7 days' :: INTERVAL

Our query became muuuuch shorter and more readable. Also, it doesn’t perform full table scans, it only uses indices.

Here is a couple of queues on how it works:

It finds users registered last 7 days (by index)
For each user found in step 1 it tries to find him in user_requests
1. If he isn’t found — then we get him (didn’t make any requests on him yet)
2. If he’s found (in index!) — then we only need to take the first item from index because our index is sorted desc

On different datasets you will get different results, but you shouldn’t get seq scans in any sensible cases.

Lessons learned

Next time when I’ll think that I need complex aggregates — I will remember about LATERAL subqueries, which can be much faster and readable