So far we've discussed query execution stages, statistics, and the two basic data access methods: Sequential scan and Index scan.
The next item on the list is join methods. This article will remind you what logical join types are out there, and then discuss one of three physical join methods, the Nested loop join. Additionally, we will check out the row memoization feature introduced in PostgreSQL 14.
Joins
Joins are the primary feature of SQL, the foundation that enables its power and agility. Sets of rows (whether pulled directly from a table or formed as a result of an operation) are always joined together in pairs. There are several types of joins.
Inner join. An
INNER JOIN
(or usually justJOIN
) is a subset of two sets that includes all the row pairs from the two original sets that match the join condition. A join condition is what ties together columns from the first row set with columns from the second one. All the columns involved comprise what is known as the join key.If the join condition is an equality operator, as is often the case, then the join is called an equijoin.
A Cartesian product (or a
CROSS JOIN
) of two sets includes all possible combinations of pairs of rows from both sets. This is a specific case of an inner join with a TRUE condition.Outer joins. The
LEFT OUTER JOIN
(or theLEFT JOIN
) supplements the result of an inner join with the rows from the left set which didn't have a corresponding pair in the right set (the contents of the missing right set columns are set to NULLs).The
RIGHT JOIN
works identically, except the joining order is reversed.The
FULL JOIN
is theLEFT JOIN
and theRIGHT JOIN
combined. It includes rows with missing pairs from both sets, as well as theINNER JOIN
result.Semijoins and antijoins. The semijoin is similar to a regular inner join, but with two key differences. Firstly, it includes only the rows from the first set that have a matching pair in the second set. Secondly, it includes the rows from the first set only once, even if a row happens to have multiple matches in the second set.
The antijoin includes only those rows from the first set that didn't have a matching pair in the second set.
SQL does not offer explicit semijoin or antijoin operations, but some expressions (
EXISTS
andNOT EXISTS
, for example) can be used to achieve equivalent results.
All of the above are logical operations. For example, you can represent an inner join as a Cartesian product that retains only the rows that satisfy the join condition. When it comes to hardware, however, there are ways to perform an inner join much more efficiently.
PostgreSQL offers several join methods:
Nested loop join
Hash join
Merge join
Join methods are algorithms that execute the join operations in SQL. They often come in various flavours tailored to specific join types. For example, an inner join that uses the nested loop mode will be represented in a plan with a Nested Loop node, but a left outer join using the same mode will look like a Nested Loop Left Join node in the plan.
Different methods shine in different conditions, and it's the planner's job to select the best one cost-wise.
Nested loop join
The nested loop algorithm is based on two loops: an inner loop within an outer loop. The outer loop searches through all the rows of the first (outer) set. For every such row, the inner loop searches for matching rows in the second (inner) set. Upon discovering a pair that satisfies the join condition, the node immediately returns it to the parent node, and then resumes scanning.
The inner loop cycles as many times as there are rows in the outer set. Therefore, the algorithm efficiency depends on several factors:
Outer set cardinality.
Existence of an efficient access method that fetches the rows from the inner set.
Number of repeat row fetches from the inner set.
Let's look at examples.
Cartesian product
A nested loop join is the most efficient way to calculate a Cartesian product, regardless of the number of rows in both sets.
EXPLAIN SELECT *
FROM aircrafts_data a1
CROSS JOIN aircrafts_data a2
WHERE a2.range > 5000;
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Nested Loop (cost=0.00..2.78 rows=45 width=144)
−> Seq Scan on aircrafts_data a1 outer set
(cost=0.00..1.09 rows=9 width=72)
−> Materialize (cost=0.00..1.14 rows=5 width=72) inner set
−> Seq Scan on aircrafts_data a2
(cost=0.00..1.11 rows=5 width=72)
Filter: (range > 5000)
(7 rows)
The Nested Loop node is where the algorithm is executed. It always has two children: the top one is the outer set, the bottom one is the inner set.
The inner set is represented with a Materialize node in this case. When called, the node stores the output of its child node in RAM (up to work_mem, then starts spilling to disk) and then returns it. Upon further calls the node returns the data from memory, avoiding additional table scans.
An inner join query might generate a similar plan:
EXPLAIN SELECT *
FROM tickets t
JOIN ticket_flights tf ON tf.ticket_no = t.ticket_no
WHERE t.ticket_no = '0005432000284';
QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
Nested Loop (cost=0.99..25.05 rows=3 width=136)
−> Index Scan using tickets_pkey on tickets t
(cost=0.43..8.45 rows=1 width=104)
Index Cond: (ticket_no = '0005432000284'::bpchar)
−> Index Scan using ticket_flights_pkey on ticket_flights tf
(cost=0.56..16.57 rows=3 width=32)
Index Cond: (ticket_no = '0005432000284'::bpchar)
(7 rows)
The planner realizes the equivalence and replaces the tf.ticket_no = t.ticket_no
condition with