Comparing changes

Currenly AIO workers process interrupts only via CHECK_FOR_INTERRUPTS, which does not include ConfigReloadPending. Thus we need to check for it explicitly.

Currently an assing hook can perform some preprocessing of a new value, but it cannot change the behavior, which dictates that the new value will be applied immediately after the hook. Certain GUC options (like shared_buffers, coming in subsequent patches) may need coordinating work between backends to change, meaning we cannot apply it right away. Add a new flag "pending" for an assign hook to allow the hook indicate exactly that. If the pending flag is set after the hook, the new value will not be applied and it's handling becomes the hook's implementation responsibility. Note, that this also requires changes in the way how GUCs are getting reported, but the patch does not cover that yet.

Currently WaitForProcSignalBarrier allows to make sure the message sent via EmitProcSignalBarrier was processed by all ProcSignal mechanism participants. Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration, which will be updated when a process has received the message, but not processed it yet. This makes it possible to support a new mode of waiting, when ProcSignal participants want to synchronize message processing. To do that, a participant can wait via WaitForProcSignalBarrierReceived when processing a message, effectively making sure that all processes are going to start processing ProcSignalBarrier simultaneously.

Currently all the work with shared memory is done via a single anonymous memory mapping, which limits ways how the shared memory could be organized. Introduce possibility to allocate multiple shared memory mappings, where a single mapping is associated with a specified shared memory segment. There is only fixed amount of available segments, currently only one main shared memory segment is allocated. A new shared memory API is introduces, extended with a segment as a new parameter. As a path of least resistance, the original API is kept in place, utilizing the main shared memory segment.

Currently the shared memory layout is designed to pack everything tight together, leaving no space between mappings for resizing. Here is how it looks like for one mapping in /proc/$PID/maps, /dev/zero represents the anonymous shared memory we talk about: 00400000-00490000 /path/bin/postgres ... 012d9000-0133e000 [heap] 7f443a800000-7f470a800000 /dev/zero (deleted) 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive 7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34 ... Make the layout more dynamic via splitting every shared memory segment into two parts: * An anonymous file, which actually contains shared memory content. Such an anonymous file is created via memfd_create, it lives in memory, behaves like a regular file and semantically equivalent to an anonymous memory allocated via mmap with MAP_ANONYMOUS. * A reservation mapping, which size is much larger than required shared segment size. This mapping is created with flags PROT_NONE (which makes sure the reserved space is not used), and MAP_NORESERVE (to not count the reserved space against memory limits). The anonymous file is mapped into this reservation mapping. The resulting layout looks like this: 00400000-00490000 /path/bin/postgres ... 3f526000-3f590000 rw-p [heap] 7fbd827fe000-7fbd8bdde000 rw-s /memfd:main (deleted) -- anon file 7fbd8bdde000-7fbe82800000 ---s /memfd:main (deleted) -- reservation 7fbe82800000-7fbe90670000 r--p /usr/lib/locale/locale-archive 7fbe90800000-7fbe90941000 r-xp /usr/lib64/libstdc++.so.6.0.34 To resize a shared memory segment in this layout it's possible to use ftruncate on the anonymous file, adjusting access permissions on the reserved space as needed. This approach also do not impact the actual memory usage as reported by the kernel. Here is the output of /proc/$PID/status for the master version with shared_buffers = 128 MB: // Peak virtual memory size, which is described as total pages // mapped in mm_struct. It corresponds to the mapped reserved space // and is the only number that grows with it. VmPeak: 2043192 kB // Size of memory portions. It contains RssAnon + RssFile + RssShmem VmRSS: 22908 kB // Size of resident anonymous memory RssAnon: 768 kB // Size of resident file mappings RssFile: 10364 kB // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and // shared anonymous mappings) RssShmem: 11776 kB Here is the same for the patch when reserving 20GB of space: VmPeak: 21255824 kB VmRSS: 25020 kB RssAnon: 768 kB RssFile: 10812 kB RssShmem: 13440 kB Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup was created with the memory limit 256 MB, then PostgreSQL was launched withing this cgroup with shared_buffers = 128 MB: $ cd /sys/fs/cgroup $ mkdir postgres $ cd postres $ echo 268435456 > memory.max $ echo $MASTER_PID_SHELL > cgroup.procs # postgres from the master branch has being successfully launched # from that shell $ cat memory.current 17465344 (~16.6 MB) # stop postgres $ echo $PATCH_PID_SHELL > cgroup.procs # postgres from the patch has being successfully launched from that shell $ cat memory.current 20770816 (~19.8 MB) To control the amount of space reserved a new GUC max_available_memory is introduced. Ideally it should be based on the maximum available memory, hense the name. There are also few unrelated advantages of using anon files: * We've got a file descriptor, which could be used for regular file operations (modification, truncation, you name it). * The file could be given a name, which improves readability when it comes to process maps. * By default, Linux will not add file-backed shared mappings into a core dump, making it more convenient to work with them in PostgreSQL: no more huge dumps to process. The downside is that memfd_create is Linux specific.

Add more shmem segments to split shared buffers into following chunks: * BUFFERS_SHMEM_SEGMENT: contains buffer blocks * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status Size of the corresponding shared data directly depends on NBuffers, meaning that if we would like to change NBuffers, they have to be resized correspondingly. Placing each of them in a separate shmem segment allows to achieve that. There are some asumptions made about each of shmem segments upper size limit. The buffer blocks have the largest, while the rest claim less extra room for resize. Ideally those limits have to be deduced from the maximum allowed shared memory.

Add assing hook for shared_buffers to resize shared memory using space, introduced in the previous commits without requiring PostgreSQL restart. Essentially the implementation is based on two mechanisms: a ProcSignalBarrier is used to make sure all processes are starting the resize procedure simultaneously, and a global Barrier is used to coordinate after that and make sure all finished processes are waiting for others that are in progress. The resize process looks like this: * The GUC assign hook sets a flag to let the Postmaster know that resize was requested. * Postmaster verifies the flag in the event loop, and starts the resize by emitting a ProcSignal barrier. * All processes, that participate in ProcSignal mechanism, begin to process ProcSignal barrier. First a process waits until all processes have confirmed they received the message and can start simultaneously. * Every process recalculates shared memory size based on the new NBuffers, adjusts its size using ftruncate and adjust reservation permissions with mprotect. One elected process signals the postmaster to do the same. * When finished, every process waits on a global ShmemControl barrier, untill all others are finished as well. This way we ensure three stages with clear boundaries: before the resize, when all processes use old NBuffers; during the resize, when processes have mix of old and new NBuffers, and wait until it's done; after the resize, when all processes use new NBuffers. * After all processes are using new value, one of them will initialize new shared structures (buffer blocks, descriptors, etc) as needed and broadcast new value of NBuffers via ShmemControl in shared memory. Other backends are waiting for this operation to finish as well. Then the barrier is lifted and everything goes as usual. Since resizing takes time, we need to take into account that during that time: - New backends can be spawned. They will check status of the barrier early during the bootstrap, and wait until everything is over to work with the new NBuffers value. - Old backends can exit before attempting to resize. Synchronization used between backends relies on ProcSignalBarrier and waits for all participants received the message at the beginning to gather all existing backends. - Some backends might be blocked and not responsing either before or after receiving the message. In the first case such backend still have ProcSignalSlot and should be waited for, in the second case shared barrier will make sure we still waiting for those backends. In any case there is an unbounded wait. - Backends might join barrier in disjoint groups with some time in between. That means that relying only on the shared dynamic barrier is not enough -- it will only synchronize resize procedure withing those groups. That's why we wait first for all participants of ProcSignal mechanism who received the message. Here is how it looks like after raising shared_buffers from 128 MB to 512 MB and calling pg_reload_conf(): -- 128 MB 7f87909fc000-7f8798248000 rw-s /memfd:strategy (deleted) 7f8798248000-7f879d6ca000 ---s /memfd:strategy (deleted) 7f879d6ca000-7f87a4e84000 rw-s /memfd:checkpoint (deleted) 7f87a4e84000-7f87aa398000 ---s /memfd:checkpoint (deleted) 7f87aa398000-7f87b1b42000 rw-s /memfd:iocv (deleted) 7f87b1b42000-7f87c3d32000 ---s /memfd:iocv (deleted) 7f87c3d32000-7f87cb59c000 rw-s /memfd:descriptors (deleted) 7f87cb59c000-7f87dd6cc000 ---s /memfd:descriptors (deleted) 7f87dd6cc000-7f87ece38000 rw-s /memfd:buffers (deleted) ^ buffers content, ~247 MB 7f87ece38000-7f8877066000 ---s /memfd:buffers (deleted) ^ reserved space, ~2210 MB 7f8877066000-7f887e7d0000 rw-s /memfd:main (deleted) 7f887e7d0000-7f8890a00000 ---s /memfd:main (deleted) -- 512 MB 7f87909fc000-7f879866a000 rw-s /memfd:strategy (deleted) 7f879866a000-7f879d6ca000 ---s /memfd:strategy (deleted) 7f879d6ca000-7f87a50f4000 rw-s /memfd:checkpoint (deleted) 7f87a50f4000-7f87aa398000 ---s /memfd:checkpoint (deleted) 7f87aa398000-7f87b1d82000 rw-s /memfd:iocv (deleted) 7f87b1d82000-7f87c3d32000 ---s /memfd:iocv (deleted) 7f87c3d32000-7f87cba1c000 rw-s /memfd:descriptors (deleted) 7f87cba1c000-7f87dd6cc000 ---s /memfd:descriptors (deleted) 7f87dd6cc000-7f8804fb8000 rw-s /memfd:buffers (deleted) ^ buffers content, ~632 MB 7f8804fb8000-7f8877066000 ---s /memfd:buffers (deleted) ^ reserved space, ~1824 MB 7f8877066000-7f887e950000 rw-s /memfd:main (deleted) 7f887e950000-7f8890a00000 ---s /memfd:main (deleted) The implementation supports only increasing of shared_buffers. For decreasing the value a similar procedure is needed. But the buffer blocks with data have to be drained first, so that the actual data set fits into the new smaller space. From experiment it turns out that shared mappings have to be extended separately for each process that uses them. Another rough edge is that a backend blocked on ReadCommand will not apply shared_buffers change until it receives something. Authors: Dmitrii Dolgov, Ashutosh Bapat

When shrinking the shared buffers pool, each buffer in the area being shrunk needs to be flushed if it's dirty so as not to loose the changes to that buffer after shrinking. Also, each such buffer needs to be removed from the buffer mapping table so that backends do not access it after shrinking. Buffer eviction requires a separate barrier phase for two reasons: 1. No other backend should map a new page to any of buffers being evicted when eviction is in progress. So they wait while eviction is in progress. 2. Since a pinned buffer has the pin recorded in the backend local memory as well as the buffer descriptor (which is in shared memory), eviction should not coincide with remapping the shared memory of a backend. Otherwise we might loose consistency of local and shared pinning records. Hence it needs to be carried out in ProcessBarrierShmemResize() and not in AnonymousShmemResize() as indicated by now removed comment. If a buffer being evicted is pinned, we raise a FATAL error but this should improve. There are multiple options 1. to wait for the pinned buffer to get unpinned, 2. the backend is killed or it itself cancels the query or 3. rollback the operation. Note that option 1 and 2 would require the pinning related local and shared records to be accessed. But we need infrastructure to do either of this right now. Ashutosh Bapat

... and BgBufferSync and ClockSweepTick adjustments The commit introduces a separate function StrategyReInitialize() instead of reusing StrategyInitialize() since some of the things that the second one does are not required in the first one. Here's list of what StrategyReInitialize() does and how does it differ from StrategyInitialize(). 1. When expanding the buffer pool add new buffers to the free list. 2. When shrinking buffers, we remove any buffers, in the area being shrunk, from the freelist. While doing so we adjust the first and last free buffer pointers in the StrategyControl area. Hence nothing more needed after resizing. 3. Check the sanity of the free buffer list is added after resizing. 4. StrategyControl pointer needn't be fetched again since it should not change. But added an Assert to make sure the pointer is valid. 5. &StrategyControl->buffer_strategy_lock need not be initialized again. 6. nextVictimBuffer, completePasses and numBufferAllocs are viewed in the context of NBuffers. Now that NBuffers itself has changed, those three do not make sense. Reset them as if the server has restarted again. This commit introduces a flag delay_shmem_resize, which postgresql backends and workers can use to signal the coordinator to delay resizing operation. Background writer sets this flag when its scanning buffers. Background writer is blocked when the actual resizing is in progress. But if resizing is about to begin, it does not scan the buffers by returning from BgBufferSync(). It stops a scan in progress when it sees that the resizing has begun. After the resizing is finished, it adjusts the collected statistics according to the new size of the buffer pool at the end of barrier processing. Once the buffer resizing is finished, before resuming the regular operation, bgwriter resets the information saved so far. This information is viewed in the context of NBuffers and hence does not make sense after NBuffer has changed. Ashutosh Bapat

If the buffer pool has been shrunk, the buffers in the buffer list may not be valid anymore. Modify GetBufferFromRing to check if the buffer is still valid before using it. This makes GetBufferFromRing() a bit more expensive because of additional boolean condition. That may not be expensive enough to affect query performance. The alternative to that is more complex as explained below. The strategy object is created in CurrentMemoryContext and is not available in any global structure thus accessible when processing buffer resizing barriers. We may modify GetAccessStrategy() to register strategy in a global linked list and then arrange to deregister it once it's no more in use. Looking at the places which use GetAccessStrategy(), fixing all those may be some work. Ashutosh Bapat

This branch was automatically generated by a robot using patches from an email thread registered at: https://commitfest.postgresql.org/patch/5319 The branch will be overwritten each time a new patch version is posted to the thread, and also periodically to check for bitrot caused by changes on the master branch. Patch(es): https://www.postgresql.org/message-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2 Author(s): Dmitry Dolgov

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comparing changes

Open a pull request

Commits on Jun 22, 2025

This comparison is taking too long to generate.

Uh oh!