Understanding Tricky Joins and Multi-Table Updates in PostgreSQL (Using Boolean Algebra!)

Paul Martinez
Affinity
Published in
14 min readAug 29, 2017

--

Writing a proper SQL UPDATE query involving multiple tables in Postgres can be tricky and counterintuitive. This stems from the fact that when performing an UPDATE, other tables are made available using a FROM clause, instead of the JOIN clause that’s normally used when fetching data from multiple tables in a SELECT statement. These have slightly different semantics, which can lead to confusion when debugging an incorrect UPDATE query. (Note that MySQL does have support for proper joins in UPDATE queries, unlike Postgres.)

A couple of weeks ago we were writing a data migration and we ran into some difficulties that directly resulted from our lack of understanding of the differences between the two types of queries. Before diving into the specific issue we ran into, let’s review how joins work in SQL:

JOIN types

We’ll first create two tables with some sample data and use them to give a quick rundown of the different types of joins.

CREATE TABLE table_1 (id INTEGER, value1 TEXT);
CREATE TABLE table_2 (id INTEGER, value2 TEXT);
INSERT INTO table_1 VALUES (1, 'a'), (2, 'b'), (3, 'c');
INSERT INTO table_2 VALUES (1, 'W'), (1, 'X'), (3, 'Y'), (5, 'Z');
SELECT * FROM table_1;
-- id | value1
-- ----+--------
-- 1 | a
-- 2 | b
-- 3 | c
SELECT * FROM table_2;
-- id | value2
-- ----+--------
-- 1 | W
-- 1 | X
-- 3 | Y
-- 5 | Z

The most common syntax for performing a join is T1 <JOIN TYPE> T2 ON <expression>, where T1 and T2 are tables, and expression is the join condition which determines if a row in T1 and a row T2 “match.”

JOIN TYPE can be one of the following (words in square brackets are optional), each generating a different result set:

  • [INNER] JOIN: For each row R1 in T1, the joined table has a row for every row R2 in T2 that satisfies the join condition:
SELECT * FROM table_1 JOIN table_2 ON table_1.id = table_2.id;
-- id | value1 | id | value2
-- ----+--------+----+--------
-- 1 | a | 1 | W
-- 1 | a | 1 | X
-- 3 | c | 3 | Y
  • LEFT [OUTER] JOIN: First an inner join is performed, then a row is added with NULL values for the columns of T2 for each row R1 in T1 that was not matched with a row in T2. This ensures that every row of T1 will be present in the outputted table:
SELECT * FROM table_1 LEFT JOIN table_2 ON table_1.id = table_2.id;
-- id | value1 | id | value2
-- ----+--------+----+--------
-- 1 | a | 1 | W
-- 1 | a | 1 | X
-- 2 | b | |
-- 3 | c | 3 | Y
  • RIGHT [OUTER] JOIN: The same as a left outer join, but with the roles of T1 and T2 swapped, ensuring that each row of T2 will be present in the outputted table.
SELECT * FROM table_1 RIGHT JOIN table_2 ON table_1.id = table_2.id;
-- id | value1 | id | value2
-- ----+--------+----+--------
-- 1 | a | 1 | W
-- 1 | a | 1 | X
-- 3 | c | 3 | Y
-- | | 5 | Z
  • FULL [OUTER] JOIN: The combination of a left outer join and a right outer join. An inner join is performed, then a new row is added for each unmatched row from T1 and T2:
SELECT * FROM table_1 FULL JOIN table_2 ON table_1.id = table_2.id;
-- id | value1 | id | value2
-- ----+--------+----+--------
-- 1 | a | 1 | W
-- 1 | a | 1 | X
-- 2 | b | |
-- 3 | c | 3 | Y
-- | | 5 | Z
  • CROSS JOIN: Every row in T1 with every row in T2, forming the Cartesian product. When using this join type there is no ON <expression> clause specifying a join condition.
SELECT * FROM table_1 CROSS JOIN table_2;
-- id | value1 | id | value2
-- ----+--------+----+--------
-- 1 | a | 1 | W
-- 1 | a | 1 | X
-- 1 | a | 3 | Y
-- 1 | a | 5 | Z
-- 2 | b | 1 | W
-- 2 | b | 1 | X
-- 2 | b | 3 | Y
-- 2 | b | 5 | Z
-- 3 | c | 1 | W
-- 3 | c | 1 | X
-- 3 | c | 3 | Y
-- 3 | c | 5 | Z

There is another, less commonly used way to select data from multiple tables. Tables can simply be listed one after another in a comma separated list in the FROM clause, and this is identical to performing a cross join. (At least when only two tables are listed and no filtering happens in a WHERE clause.)

SELECT * FROM table_1, table_2;
-- id | value1 | id | value2
-- ----+--------+----+--------
-- 1 | a | 1 | W
-- 1 | a | 1 | X
-- 1 | a | 3 | Y
-- 1 | a | 5 | Z
-- 2 | b | 1 | W
-- 2 | b | 1 | X
-- 2 | b | 3 | Y
-- 2 | b | 5 | Z
-- 3 | c | 1 | W
-- 3 | c | 1 | X
-- 3 | c | 3 | Y
-- 3 | c | 5 | Z

Notice that this is also the result one would get performing a regular join with TRUE as a condition. And going the other way, notice that one can perform an inner join on a condition C by performing a cross join and then using a WHERE clause with condition C. That is, the statements SELECT * FROM T1 INNER JOIN T2 ON C and SELECT * FROM T1 CROSS JOIN T2 WHERE C are equivalent.

Some Surprising Results of Join Logic

Because outer joins are defined in terms of inner joins, generating the result set for a query can involve alternating periods where rows are added and where rows are removed. This can lead to complex behavior when combining an outer join with a WHERE clause. Note that the following two queries are not equivalent:

SELECT * FROM T1 FULL JOIN T2 ON A WHERE B;
SELECT * FROM T1 FULL JOIN T2 ON A AND B;

See the results applied to our example tables:

SELECT * FROM table_1 FULL JOIN table_2 ON table_1.id = table_2.id WHERE value1 = 'a';
-- id | value1 | id | value2
-- ----+--------+----+--------
-- 1 | a | 1 | W
-- 1 | a | 1 | X
SELECT * FROM table_1 FULL JOIN table_2 ON table_1.id = table_2.id AND value1 = 'a';
-- id | value1 | id | value2
-- ----+--------+----+--------
-- 1 | a | 1 | W
-- 1 | a | 1 | X
-- 2 | b | |
-- 3 | c | |
-- | | 5 | Z
-- | | 3 | Y

The query of the form ON A WHERE B returns a subset of the results returned by the query of the form ON A AND B. (Is it always a subset? Yes! Proving so is left as an exercise for the reader.) If we walk through the steps to generate the result sets, we can see how the two queries differ.

SELECT * FROM T1 FULL OUTER JOIN T2 ON A WHERE B:"JOIN": Start with cartesian product, T1 X T2
"ON A": REMOVE rows (R1, R2) not matching A
"FULL OUTER": ADD rows (R1, NULL) and (NULL, R2) where R1 and R2 were missing.
"WHERE B": REMOVE rows not matching B
SELECT * FROM T1 FULL OUTER JOIN T2 ON A AND B:"JOIN": Start with cartesian product, T1 X T2
"ON A AND B": REMOVE rows (R1, R2) not matching (A AND B)
"FULL OUTER": ADD rows (R1, NULL) and (NULL, R2) where R1 and R2 were missing.

The first query does two separate filtering passes, separated by the addition of rows due to the outer join, while the second query just does a single filtering pass by both conditions. The filtering pass for condition B occurs at a different time in each query, and operates over different intermediate results, so it’s easy to see how the two queries are not equivalent.

In general, the vast majority of queries only use inner joins and left joins, but it’s crucial to understand their relationship with cross joins as well to understand how UPDATE queries work.

UPDATE queries

Let’s take a closer look at the syntax for UPDATE queries. Here is a stripped down grammar for the most common cases (starting from the Postgres grammar):

UPDATE table_name
SET { column_name = expression } [, ...]
[ FROM from_list ]
WHERE condition

Updating a row based on a row in another table (“performing an update with a join,” in other words), is trickier because you can only specify additional tables in the FROM clause. The Postgres documentation has this to say on the subject:

When a FROM clause is present, what essentially happens is that the target table is joined to the tables mentioned in the from_list, and each output row of the join represents an update operation for the target table. When using FROM you should ensure that the join produces at most one output row for each row to be modified. In other words, a target row shouldn’t join to more than one row from the other table(s). If it does, then only one of the join rows will be used to update the target row, but which one will be used is not readily predictable.

The following query shows how UPDATE queries can have unpredictable results. (The contents of the two tables is included for reference.)

SELECT * FROM table_1;
-- id | value1
-- ----+--------
-- 1 | a
-- 2 | b
-- 3 | c
SELECT * FROM table_2;
-- id | value2
-- ----+--------
-- 1 | W
-- 1 | X
-- 3 | Y
-- 5 | Z
UPDATE table_1 SET value1 = value2
FROM table_2 WHERE table_1.id = table_2.id;
SELECT * FROM table_1;
-- id | value1
-- ----+--------
-- 2 | b
-- 1 | ???
-- 3 | Y

What should the missing value of value1 be? Should it be W or X? It could be either depending on the order of the joined rows constructed by Postgres.

While this is certainly something to watch out for, more commonly we might just want to join in a table to decide which rows we want to update. Suppose we only want to update rows in table_2 that correspond to a table_1 row with value1 = c, or rows in table_2 that don’t correspond to any value in table_1. We can express which rows we want to update easily with a SELECT query:

SELECT table_2.* FROM table_2
LEFT JOIN table_1 ON table_1.id = table_2.id
WHERE value1 = 'c' OR table_1.id IS NULL;
-- id | value2
-- ----+--------
-- 3 | Y
-- 5 | Z

But if we naively try to convert this to an UPDATE query, we get:

UPDATE table_2 SET value2 = 'updated'
FROM table_1
WHERE
(table_1.id = table_2.id AND value1 = 'c') OR
table_1.id IS NULL;
-- UPDATE 1 (??? Shouldn't this be 2?)

Because the FROM clause performs a cross join, there won’t be any rows in the intermediate table with a null table_1.id. This makes this query impossible to represent without resorting to nested queries! More evidence that UPDATE queries involving multiple tables can be quite difficult to write correctly.

Tricky Updates in Practice: Updating Sheet Columns

A recent product update required a data migration that left us scratching our heads trying to figure out why our queries weren’t working. We’ll provide some context about the schema of the table we were modifying, explain the changes we needed to make, then explain how we used boolean algebra to help debug the issue.

In the Affinity web app, one of our main interfaces is a structured spreadsheet. You create new lists of persons or companies through this interface, but you can also use it to view all the people or organizations your company has ever communicated with. In this spreadsheet rows are people or organizations and columns are either custom columns that the the user manually edits, such the Status of a Deal, or data that Affinity automatically calculates and displays, such as the date of the last email correspondence with an entity.

Users can rearrange the order of columns, and choose to show or hide certain columns. These actions are local to the user, so if they hide a column it isn’t hidden for their coworkers, and persistent, so if you log out and log back in you’ll see the columns in the same order as they were originally. Initially we stored this data in the user’s browser, using LocalStorage, but this turned into an unmaintainable nightmare, so we decided to store the column state in our database. We used the following schema:


CREATE TABLE user_sheet_columns (
id INTEGER PRIMARY KEY,
user_id INTEGER REFERENCES users (id) NOT NULL,
list_id INTEGER REFERENCES lists (id),
all_list_type INTEGER,

field_id INTEGER REFERENCES fields (id),
key TEXT,
index INTEGER NOT NULL,
hidden BOOLEAN NOT NULL
);

The user_id column indicates which user’s state we’re storing.

list_id and all_list_type together indicate which page’s appearance we’re saving. If this is data for a list view, then list_id is set and all_list_type is NULL. If this is state for viewing the page of all persons or organizations, then all_list_type is set to 0 or 1 respectively and list_id is NULL. (The lists table has a model_type column that’s also set to 0 or 1 to indicate if it’s a list of persons or companies.)

field_id and key together indicate the data shown in this column. For user populated columns, like Status, field_id points to the record that represents the Status column. For automatically populated data, key will be set to a string constant, like last_email or next_event.

And finally index and hidden are used to store the order of columns and which ones have been hidden by the user.

Therefore my saved view of all the organizations we’ve talked to at Affinity might be represented by the following records:

user_id | list_id | all_list_type | field_id |      key     | index 
--------|---------|---------------|----------|--------------|-------
38707 | NULL | 1 | NULL | 'name' | 0
38707 | NULL | 1 | NULL | 'last_email' | 1
38707 | NULL | 1 | 87 | NULL | 2
38707 | NULL | 1 | 98 | NULL | 3

and my view of our engineering recruiting list might look like this:

user_id | list_id | all_list_type | field_id |      key     | index
--------|---------|---------------|----------|--------------|------
38707 | 109 | NULL | NULL | 'name' | 0
38707 | 109 | NULL | 351 | NULL | 2
38707 | 109 | NULL | NULL | 'last_email' | 1
38707 | 109 | NULL | 318 | NULL | 3

Converting Source of Introduction to a Field

Previously we had displayed Source of Introduction as a value that we had computed ourselves, but we got feedback that this data wasn’t always correct and that users wanted to be able to change this value. This column was originally indicated by having key set to source and field_id set to NULL, but in order to enable manually editing, we had to create two new Fields, one for persons and one for organizations, and update all the appropriate rows in our user_sheet_columns table.

We could perform this migration with two UPDATE queries, one to update the columns on the All Persons page and lists of people, and one to update the columns on the All Organizations page and lists of organizations.

Now, selecting which rows we wanted to update in each case could easily be expressed using a left outer join:

SELECT * FROM user_sheet_columns
LEFT JOIN lists ON list_id = lists.id
WHERE
(all_list_type = 0 || lists.model_type = 0) AND
(key = 'source');

But with UPDATE queries, you can’t use a JOIN expression and instead have to bring in other tables in a FROM clause: UPDATE T1 SET ... FROM T2.... We tried to use the same approach here:

UPDATE user_sheet_columns SET
field_id = 2351, key = NULL
FROM lists
WHERE
-- 'join' condition here:
(list_id = lists.id OR list_id IS NULL) AND
-- only update columns for people views
(
(all_list_type = 0 AND list_id IS NULL) OR
(lists.model_type = 0)
) AND
-- only update source columns
(key = 'source');

But this query didn’t work. For some reason all the rows where list_id was NULL were getting updated, even those with all_list_type set to 1. But it was right there in our query! We joined in the list table appropriately (first condition), then if list_id and lists.model_type is NULL, we only update the row if all_list_type is 0. Clearly something was wrong…

Desperate and frustrated, we used the time honored tradition of throwing out our work and just tried rewriting the query a different way:

UPDATE user_sheet_columns SET
field_id = 2351, key = NULL
FROM lists
WHERE
(
(all_list_type = 0) OR
(list_id = lists.id AND lists.model_type = 0)
) AND
(key = 'source');

Sure enough, this version worked. Clearly this version looks a little simpler, but why did this one work while the other one didn’t?

Recall that the tables are cross joined together when using a FROM clause in an UPDATE query. In the context of our user_sheet_columns query, this means that every user_sheet_columns row was paired with every row in lists, and every combined row that matches the WHERE condition gets updated. But we wrote the condition for the first query as if we were doing a join, relying on certain rows to get filtered out, then others added before doing another filtering pass. If we don’t want to update a row, then we need to make sure it doesn’t end up in the result set.

Let’s use boolean algebra to deliberately break these conditions down and see if we can track down the issue. Both queries have the key = 'source' condition, which we can be pretty sure is correct, so let’s ignore that.

Other than that we have four other equality checks present between the two queries:

Incorrect Query Condition:
WHERE
(list_id = lists.id OR list_id IS NULL) AND
(
(all_list_type = 0 AND list_id IS NULL) OR
(lists.model_type = 0)
)
Correct Query Condition:
WHERE
(all_list_type = 0) OR
(list_id = lists.id AND lists.model_type = 0)

Let’s assign these the letters A, B, C, and D so they’re easier to manipulate.

A: list_id = lists.id
B: list_id IS NULL
C: all_list_type = 0
D: lists.model_type = 0

And substitute these in the actual conditions:

Incorrect Query Condition:
(A OR B) AND
(
(C AND B) OR
(D)
)
Correct Query Condition:
(C) OR
(A AND D)
Incorrect Query: (A ∨ B) ∧ ((C ∧ B) ∨ D)
Correct Query: C ∨ (A ∧ D)

We observed that the rows that were incorrectly getting update were rows with all_list_type = 1 (and so also list_id was set to NULL). This implied that for these rows A was false, B was true, and C was false. This meant the condition for these rows simplified to simply D.

all_list_type = 1 AND list_id IS NULL =>
A = False
B = True
C = False
=> (A ∨ B) ∧ ((C ∧ B) ∨ D)
=== (F ∨ T) ∧ ((F ∧ T) ∨ D)
=== T ∧ (F ∨ D)
=== T ∧ D
=== D

Since each row in user_sheet_columns was paired with every column in lists, then as long as there was a row in lists that satisfied D, then the rows with all_list_type = 1 would be included in the result set. And of course there were lists rows that satisfied D, because a list satisfied D if it was a list of persons, and we sure had plenty of those! So we were updating these rows for the all organizations pages because they had been joined with completely unrelated lists of people.

So how did the correct query avoid this issue? Well, the rows for the all organizations pages had all_list_type = 1, so C was false. So then to be included in the result set A ∧ D had to hold, translated back to the original as (list_id = lists.id) AND (lists.model_type = 0). But since all_list_type was set for these rows, that meant that list_id was NULL, so condition A was false, guaranteeing their exclusion from the result set. Thanks, George Boole.

https://upload.wikimedia.org/wikipedia/commons/c/ce/George_Boole_color.jpg

Fun fact: While looking up how perform updates in various RDBMS dialects, I tried to find the actual SQL standard. It can be found here, for a cool 178 Swiss Francs.

--

--