Deleting index tuples during VACUUM
-----------------------------------
-Before deleting a leaf item, we get a super-exclusive lock on the target
+Before deleting a leaf item, we get a full cleanup lock on the target
page, so that no other backend has a pin on the page when the deletion
starts. This is not necessary for correctness in terms of the btree index
operations themselves; as explained above, index scans logically stop
"between" pages and so can't lose their place. The reason we do it is to
-provide an interlock between VACUUM and indexscans. Since VACUUM deletes
-index entries before reclaiming heap tuple line pointers, the
-super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
-line pointer that an indexscanning process might be about to visit. This
-guarantee works only for simple indexscans that visit the heap in sync
-with the index scan, not for bitmap scans. We only need the guarantee
-when using non-MVCC snapshot rules; when using an MVCC snapshot, it
-doesn't matter if the heap tuple is replaced with an unrelated tuple at
-the same TID, because the new tuple won't be visible to our scan anyway.
-Therefore, a scan using an MVCC snapshot which has no other confounding
-factors will not hold the pin after the page contents are read. The
-current reasons for exceptions, where a pin is still needed, are if the
-index is not WAL-logged or if the scan is an index-only scan. If later
-work allows the pin to be dropped for all cases we will be able to
-simplify the vacuum code, since the concept of a super-exclusive lock
-for btree indexes will no longer be needed.
+provide an interlock between VACUUM and index scans that are not prepared
+to deal with concurrent TID recycling when visiting the heap. Since only
+VACUUM can ever mark pointed-to items LP_UNUSED in the heap, and since
+this only ever happens _after_ btbulkdelete returns, having index scans
+hold on to the pin (used when reading from the leaf page) until _after_
+they're done visiting the heap (for TIDs from pinned leaf page) prevents
+concurrent TID recycling. VACUUM cannot get a conflicting cleanup lock
+until the index scan is totally finished processing its leaf page.
+
+This approach is fairly coarse, so we avoid it whenever possible. In
+practice most index scans won't hold onto their pin, and so won't block
+VACUUM. These index scans must deal with TID recycling directly, which is
+more complicated and not always possible. See later section on making
+concurrent TID recycling safe.
+
+Opportunistic index tuple deletion performs almost the same page-level
+modifications while only holding an exclusive lock. This is safe because
+there is no question of TID recycling taking place later on -- only VACUUM
+can make TIDs recyclable. See also simple deletion and bottom-up
+deletion, below.
Because a pin is not always held, and a page can be split even while
someone does hold a pin on it, it is possible that an indexscan will
return items that are no longer stored on the page it has a pin on, but
rather somewhere to the right of that page. To ensure that VACUUM can't
-prematurely remove such heap tuples, we require btbulkdelete to obtain a
-super-exclusive lock on every leaf page in the index, even pages that
-don't contain any deletable tuples. Any scan which could yield incorrect
-results if the tuple at a TID matching the scan's range and filter
-conditions were replaced by a different tuple while the scan is in
-progress must hold the pin on each index page until all index entries read
-from the page have been processed. This guarantees that the btbulkdelete
-call cannot return while any indexscan is still holding a copy of a
-deleted index tuple if the scan could be confused by that. Note that this
-requirement does not say that btbulkdelete must visit the pages in any
-particular order. (See also simple deletion and bottom-up deletion,
-below.)
-
-There is no such interlocking for deletion of items in internal pages,
-since backends keep no lock nor pin on a page they have descended past.
-Hence, when a backend is ascending the tree using its stack, it must
-be prepared for the possibility that the item it wants is to the left of
-the recorded position (but it can't have moved left out of the recorded
-page). Since we hold a lock on the lower page (per L&Y) until we have
-re-found the parent item that links to it, we can be assured that the
-parent item does still exist and can't have been deleted.
+prematurely make TIDs recyclable in this scenario, we require btbulkdelete
+to obtain a cleanup lock on every leaf page in the index, even pages that
+don't contain any deletable tuples. Note that this requirement does not
+say that btbulkdelete must visit the pages in any particular order.
VACUUM's linear scan, concurrent page splits
--------------------------------------------
page's contents will be overwritten by the split operation (it will become
the new right sibling page).
+Making concurrent TID recycling safe
+------------------------------------
+
+As explained in the earlier section about deleting index tuples during
+VACUUM, we implement a locking protocol that allows individual index scans
+to avoid concurrent TID recycling. Index scans opt-out (and so drop their
+leaf page pin when visiting the heap) whenever it's safe to do so, though.
+Dropping the pin early is useful because it avoids blocking progress by
+VACUUM. This is particularly important with index scans used by cursors,
+since idle cursors sometimes stop for relatively long periods of time. In
+extreme cases, a client application may hold on to an idle cursors for
+hours or even days. Blocking VACUUM for that long could be disastrous.
+
+Index scans that don't hold on to a buffer pin are protected by holding an
+MVCC snapshot instead. This more limited interlock prevents wrong answers
+to queries, but it does not prevent concurrent TID recycling itself (only
+holding onto the leaf page pin while accessing the heap ensures that).
+
+Index-only scans can never drop their buffer pin, since they are unable to
+tolerate having a referenced TID become recyclable. Index-only scans
+typically just visit the visibility map (not the heap proper), and so will
+not reliably notice that any stale TID reference (for a TID that pointed
+to a dead-to-all heap item at first) was concurrently marked LP_UNUSED in
+the heap by VACUUM. This could easily allow VACUUM to set the whole heap
+page to all-visible in the visibility map immediately afterwards. An MVCC
+snapshot is only sufficient to avoid problems during plain index scans
+because they must access granular visibility information from the heap
+proper. A plain index scan will even recognize LP_UNUSED items in the
+heap (items that could be recycled but haven't been just yet) as "not
+visible" -- even when the heap page is generally considered all-visible.
+
+LP_DEAD setting of index tuples by the kill_prior_tuple optimization
+(described in full in simple deletion, below) is also more complicated for
+index scans that drop their leaf page pins. We must be careful to avoid
+LP_DEAD-marking any new index tuple that looks like a known-dead index
+tuple because it happens to share the same TID, following concurrent TID
+recycling. It's just about possible that some other session inserted a
+new, unrelated index tuple, on the same leaf page, which has the same
+original TID. It would be totally wrong to LP_DEAD-set this new,
+unrelated index tuple.
+
+We handle this kill_prior_tuple race condition by having affected index
+scans conservatively assume that any change to the leaf page at all
+implies that it was reached by btbulkdelete in the interim period when no
+buffer pin was held. This is implemented by not setting any LP_DEAD bits
+on the leaf page at all when the page's LSN has changed. (That won't work
+with an unlogged index, so for now we don't ever apply the "don't hold
+onto pin" optimization there.)
+
Fastpath For Index Insertion
----------------------------
groups of dead TIDs from posting list tuples (without the situation ever
being allowed to get out of hand).
-It's sufficient to have an exclusive lock on the index page, not a
-super-exclusive lock, to do deletion of LP_DEAD items. It might seem
-that this breaks the interlock between VACUUM and indexscans, but that is
-not so: as long as an indexscanning process has a pin on the page where
-the index item used to be, VACUUM cannot complete its btbulkdelete scan
-and so cannot remove the heap tuple. This is another reason why
-btbulkdelete has to get a super-exclusive lock on every leaf page, not only
-the ones where it actually sees items to delete.
-
-LP_DEAD setting by index scans cannot be sure that a TID whose index tuple
-it had planned on LP_DEAD-setting has not been recycled by VACUUM if it
-drops its pin in the meantime. It must conservatively also remember the
-LSN of the page, and only act to set LP_DEAD bits when the LSN has not
-changed at all. (Avoiding dropping the pin entirely also makes it safe, of
-course.)
-
Bottom-Up deletion
------------------
changes state into a normally running server.
The interlocking required to avoid returning incorrect results from
-non-MVCC scans is not required on standby nodes. We still get a
-super-exclusive lock ("cleanup lock") when replaying VACUUM records
-during recovery, but recovery does not need to lock every leaf page
-(only those leaf pages that have items to delete). That is safe because
-HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
-HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only ever
-used during write transactions, which cannot exist on the standby. MVCC
-scans are already protected by definition, so HeapTupleSatisfiesMVCC()
-is not a problem. The optimizer looks at the boundaries of value ranges
-using HeapTupleSatisfiesNonVacuumable() with an index-only scan, which
-is also safe. That leaves concern only for HeapTupleSatisfiesToast().
-
-HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
-because it doesn't need to - if the main heap row is visible then the
-toast rows will also be visible. So as long as we follow a toast
-pointer from a visible (live) tuple the corresponding toast rows
-will also be visible, so we do not need to recheck MVCC on them.
+non-MVCC scans is not required on standby nodes. We still get a full
+cleanup lock when replaying VACUUM records during recovery, but recovery
+does not need to lock every leaf page (only those leaf pages that have
+items to delete) -- that's sufficient to avoid breaking index-only scans
+during recovery (see section above about making TID recycling safe). That
+leaves concern only for plain index scans. (XXX: Not actually clear why
+this is totally unnecessary during recovery.)
+
+MVCC snapshot plain index scans are always safe, for the same reasons that
+they're safe during original execution. HeapTupleSatisfiesToast() doesn't
+use MVCC semantics, though that's because it doesn't need to - if the main
+heap row is visible then the toast rows will also be visible. So as long
+as we follow a toast pointer from a visible (live) tuple the corresponding
+toast rows will also be visible, so we do not need to recheck MVCC on
+them.
Other Things That Are Handy to Know
-----------------------------------