Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: postgresql-cfbot/postgresql
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: cf/5319~1
Choose a base ref
...
head repository: postgresql-cfbot/postgresql
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: cf/5319
Choose a head ref
  • 11 commits
  • 41 files changed
  • 3 contributors

Commits on Jun 22, 2025

  1. Process config reload in AIO workers

    Currenly AIO workers process interrupts only via CHECK_FOR_INTERRUPTS,
    which does not include ConfigReloadPending. Thus we need to check for it
    explicitly.
    erthalion authored and Commitfest Bot committed Jun 22, 2025
    Configuration menu
    Copy the full SHA
    d3c9e06 View commit details
    Browse the repository at this point in the history
  2. Introduce pending flag for GUC assign hooks

    Currently an assing hook can perform some preprocessing of a new value,
    but it cannot change the behavior, which dictates that the new value
    will be applied immediately after the hook. Certain GUC options (like
    shared_buffers, coming in subsequent patches) may need coordinating work
    between backends to change, meaning we cannot apply it right away.
    
    Add a new flag "pending" for an assign hook to allow the hook indicate
    exactly that. If the pending flag is set after the hook, the new value
    will not be applied and it's handling becomes the hook's implementation
    responsibility.
    
    Note, that this also requires changes in the way how GUCs are getting
    reported, but the patch does not cover that yet.
    erthalion authored and Commitfest Bot committed Jun 22, 2025
    Configuration menu
    Copy the full SHA
    1aa33c3 View commit details
    Browse the repository at this point in the history
  3. Introduce pss_barrierReceivedGeneration

    Currently WaitForProcSignalBarrier allows to make sure the message sent
    via EmitProcSignalBarrier was processed by all ProcSignal mechanism
    participants.
    
    Add pss_barrierReceivedGeneration alongside with pss_barrierGeneration,
    which will be updated when a process has received the message, but not
    processed it yet. This makes it possible to support a new mode of
    waiting, when ProcSignal participants want to synchronize message
    processing. To do that, a participant can wait via
    WaitForProcSignalBarrierReceived when processing a message, effectively
    making sure that all processes are going to start processing
    ProcSignalBarrier simultaneously.
    erthalion authored and Commitfest Bot committed Jun 22, 2025
    Configuration menu
    Copy the full SHA
    0a23961 View commit details
    Browse the repository at this point in the history
  4. Allow to use multiple shared memory mappings

    Currently all the work with shared memory is done via a single anonymous
    memory mapping, which limits ways how the shared memory could be organized.
    
    Introduce possibility to allocate multiple shared memory mappings, where
    a single mapping is associated with a specified shared memory segment.
    There is only fixed amount of available segments, currently only one
    main shared memory segment is allocated. A new shared memory API is
    introduces, extended with a segment as a new parameter. As a path of
    least resistance, the original API is kept in place, utilizing the main
    shared memory segment.
    erthalion authored and Commitfest Bot committed Jun 22, 2025
    Configuration menu
    Copy the full SHA
    4903c01 View commit details
    Browse the repository at this point in the history
  5. Address space reservation for shared memory

    Currently the shared memory layout is designed to pack everything tight
    together, leaving no space between mappings for resizing. Here is how it
    looks like for one mapping in /proc/$PID/maps, /dev/zero represents the
    anonymous shared memory we talk about:
    
        00400000-00490000         /path/bin/postgres
        ...
        012d9000-0133e000         [heap]
        7f443a800000-7f470a800000 /dev/zero (deleted)
        7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
        7f4718400000-7f4718401000 /usr/lib64/libstdc++.so.6.0.34
        ...
    
    Make the layout more dynamic via splitting every shared memory segment
    into two parts:
    
    * An anonymous file, which actually contains shared memory content. Such
      an anonymous file is created via memfd_create, it lives in memory,
      behaves like a regular file and semantically equivalent to an
      anonymous memory allocated via mmap with MAP_ANONYMOUS.
    
    * A reservation mapping, which size is much larger than required shared
      segment size. This mapping is created with flags PROT_NONE (which
      makes sure the reserved space is not used), and MAP_NORESERVE (to not
      count the reserved space against memory limits). The anonymous file is
      mapped into this reservation mapping.
    
    The resulting layout looks like this:
    
        00400000-00490000         /path/bin/postgres
        ...
        3f526000-3f590000 rw-p 		[heap]
        7fbd827fe000-7fbd8bdde000 rw-s 	/memfd:main (deleted) -- anon file
        7fbd8bdde000-7fbe82800000 ---s 	/memfd:main (deleted) -- reservation
        7fbe82800000-7fbe90670000 r--p 	/usr/lib/locale/locale-archive
        7fbe90800000-7fbe90941000 r-xp 	/usr/lib64/libstdc++.so.6.0.34
    
    To resize a shared memory segment in this layout it's possible to use ftruncate
    on the anonymous file, adjusting access permissions on the reserved space as
    needed.
    
    This approach also do not impact the actual memory usage as reported by
    the kernel. Here is the output of /proc/$PID/status for the master
    version with shared_buffers = 128 MB:
    
        // Peak virtual memory size, which is described as total pages
        // mapped in mm_struct. It corresponds to the mapped reserved space
        // and is the only number that grows with it.
        VmPeak:          2043192 kB
        // Size of memory portions. It contains RssAnon + RssFile + RssShmem
        VmRSS:             22908 kB
        // Size of resident anonymous memory
        RssAnon:             768 kB
        // Size of resident file mappings
        RssFile:           10364 kB
        // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
        // shared anonymous mappings)
        RssShmem:          11776 kB
    
    Here is the same for the patch when reserving 20GB of space:
    
        VmPeak:         21255824 kB
        VmRSS:             25020 kB
        RssAnon:             768 kB
        RssFile:           10812 kB
        RssShmem:          13440 kB
    
    Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
    was created with the memory limit 256 MB, then PostgreSQL was launched withing
    this cgroup with shared_buffers = 128 MB:
    
        $ cd /sys/fs/cgroup
        $ mkdir postgres
        $ cd postres
        $ echo 268435456 > memory.max
    
        $ echo $MASTER_PID_SHELL > cgroup.procs
        # postgres from the master branch has being successfully launched
        #  from that shell
        $ cat memory.current
        17465344 (~16.6 MB)
        # stop postgres
    
        $ echo $PATCH_PID_SHELL > cgroup.procs
        # postgres from the patch has being successfully launched from that shell
        $ cat memory.current
        20770816 (~19.8 MB)
    
    To control the amount of space reserved a new GUC max_available_memory
    is introduced. Ideally it should be based on the maximum available
    memory, hense the name.
    
    There are also few unrelated advantages of using anon files:
    
    * We've got a file descriptor, which could be used for regular file
      operations (modification, truncation, you name it).
    
    * The file could be given a name, which improves readability when it
      comes to process maps.
    
    * By default, Linux will not add file-backed shared mappings into a core dump,
      making it more convenient to work with them in PostgreSQL: no more huge dumps
      to process.
    
    The downside is that memfd_create is Linux specific.
    erthalion authored and Commitfest Bot committed Jun 22, 2025
    Configuration menu
    Copy the full SHA
    661c0bf View commit details
    Browse the repository at this point in the history
  6. Introduce multiple shmem segments for shared buffers

    Add more shmem segments to split shared buffers into following chunks:
    * BUFFERS_SHMEM_SEGMENT: contains buffer blocks
    * BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
    * BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
    * CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
    * STRATEGY_SHMEM_SEGMENT: contains buffer strategy status
    
    Size of the corresponding shared data directly depends on NBuffers,
    meaning that if we would like to change NBuffers, they have to be
    resized correspondingly. Placing each of them in a separate shmem
    segment allows to achieve that.
    
    There are some asumptions made about each of shmem segments upper size
    limit. The buffer blocks have the largest, while the rest claim less
    extra room for resize. Ideally those limits have to be deduced from the
    maximum allowed shared memory.
    erthalion authored and Commitfest Bot committed Jun 22, 2025
    Configuration menu
    Copy the full SHA
    726f92c View commit details
    Browse the repository at this point in the history
  7. Allow to resize shared memory without restart

    Add assing hook for shared_buffers to resize shared memory using space,
    introduced in the previous commits without requiring PostgreSQL restart.
    Essentially the implementation is based on two mechanisms: a
    ProcSignalBarrier is used to make sure all processes are starting the
    resize procedure simultaneously, and a global Barrier is used to
    coordinate after that and make sure all finished processes are waiting
    for others that are in progress.
    
    The resize process looks like this:
    
    * The GUC assign hook sets a flag to let the Postmaster know that resize
      was requested.
    
    * Postmaster verifies the flag in the event loop, and starts the resize
      by emitting a ProcSignal barrier.
    
    * All processes, that participate in ProcSignal mechanism, begin to
      process ProcSignal barrier. First a process waits until all processes
      have confirmed they received the message and can start simultaneously.
    
    * Every process recalculates shared memory size based on the new
      NBuffers, adjusts its size using ftruncate and adjust reservation
      permissions with mprotect. One elected process signals the postmaster
      to do the same.
    
    * When finished, every process waits on a global ShmemControl barrier,
      untill all others are finished as well. This way we ensure three
      stages with clear boundaries: before the resize, when all processes
      use old NBuffers; during the resize, when processes have mix of old
      and new NBuffers, and wait until it's done; after the resize, when all
      processes use new NBuffers.
    
    * After all processes are using new value, one of them will initialize
      new shared structures (buffer blocks, descriptors, etc) as needed and
      broadcast new value of NBuffers via ShmemControl in shared memory.
      Other backends are waiting for this operation to finish as well. Then
      the barrier is lifted and everything goes as usual.
    
    Since resizing takes time, we need to take into account that during that time:
    
    - New backends can be spawned. They will check status of the barrier
      early during the bootstrap, and wait until everything is over to work
      with the new NBuffers value.
    
    - Old backends can exit before attempting to resize. Synchronization
      used between backends relies on ProcSignalBarrier and waits for all
      participants received the message at the beginning to gather all
      existing backends.
    
    - Some backends might be blocked and not responsing either before or
      after receiving the message. In the first case such backend still
      have ProcSignalSlot and should be waited for, in the second case
      shared barrier will make sure we still waiting for those backends. In
      any case there is an unbounded wait.
    
    - Backends might join barrier in disjoint groups with some time in
      between. That means that relying only on the shared dynamic barrier is
      not enough -- it will only synchronize resize procedure withing those
      groups. That's why we wait first for all participants of ProcSignal
      mechanism who received the message.
    
    Here is how it looks like after raising shared_buffers from 128 MB to
    512 MB and calling pg_reload_conf():
    
        -- 128 MB
        7f87909fc000-7f8798248000 rw-s /memfd:strategy (deleted)
        7f8798248000-7f879d6ca000 ---s /memfd:strategy (deleted)
        7f879d6ca000-7f87a4e84000 rw-s /memfd:checkpoint (deleted)
        7f87a4e84000-7f87aa398000 ---s /memfd:checkpoint (deleted)
        7f87aa398000-7f87b1b42000 rw-s /memfd:iocv (deleted)
        7f87b1b42000-7f87c3d32000 ---s /memfd:iocv (deleted)
        7f87c3d32000-7f87cb59c000 rw-s /memfd:descriptors (deleted)
        7f87cb59c000-7f87dd6cc000 ---s /memfd:descriptors (deleted)
        7f87dd6cc000-7f87ece38000 rw-s /memfd:buffers (deleted)
        ^ buffers content, ~247 MB
        7f87ece38000-7f8877066000 ---s /memfd:buffers (deleted)
        ^ reserved space, ~2210 MB
        7f8877066000-7f887e7d0000 rw-s /memfd:main (deleted)
        7f887e7d0000-7f8890a00000 ---s /memfd:main (deleted)
    
        -- 512 MB
        7f87909fc000-7f879866a000 rw-s /memfd:strategy (deleted)
        7f879866a000-7f879d6ca000 ---s /memfd:strategy (deleted)
        7f879d6ca000-7f87a50f4000 rw-s /memfd:checkpoint (deleted)
        7f87a50f4000-7f87aa398000 ---s /memfd:checkpoint (deleted)
        7f87aa398000-7f87b1d82000 rw-s /memfd:iocv (deleted)
        7f87b1d82000-7f87c3d32000 ---s /memfd:iocv (deleted)
        7f87c3d32000-7f87cba1c000 rw-s /memfd:descriptors (deleted)
        7f87cba1c000-7f87dd6cc000 ---s /memfd:descriptors (deleted)
        7f87dd6cc000-7f8804fb8000 rw-s /memfd:buffers (deleted)
        ^ buffers content, ~632 MB
        7f8804fb8000-7f8877066000 ---s /memfd:buffers (deleted)
        ^ reserved space, ~1824 MB
        7f8877066000-7f887e950000 rw-s /memfd:main (deleted)
        7f887e950000-7f8890a00000 ---s /memfd:main (deleted)
    
    The implementation supports only increasing of shared_buffers. For
    decreasing the value a similar procedure is needed. But the buffer
    blocks with data have to be drained first, so that the actual data set
    fits into the new smaller space.
    
    From experiment it turns out that shared mappings have to be extended
    separately for each process that uses them. Another rough edge is that a
    backend blocked on ReadCommand will not apply shared_buffers change
    until it receives something.
    
    Authors: Dmitrii Dolgov, Ashutosh Bapat
    erthalion authored and Commitfest Bot committed Jun 22, 2025
    Configuration menu
    Copy the full SHA
    60045c1 View commit details
    Browse the repository at this point in the history
  8. Support shrinking shared buffers

    When shrinking the shared buffers pool, each buffer in the area being
    shrunk needs to be flushed if it's dirty so as not to loose the changes
    to that buffer after shrinking. Also, each such buffer needs to be
    removed from the buffer mapping table so that backends do not access it
    after shrinking.
    
    Buffer eviction requires a separate barrier phase for two reasons:
    
    1. No other backend should map a new page to any of  buffers being
       evicted when eviction is in progress. So they wait while eviction is
       in progress.
    
    2. Since a pinned buffer has the pin recorded in the backend local
       memory as well as the buffer descriptor (which is in shared memory),
       eviction should not coincide with remapping the shared memory of a
       backend. Otherwise we might loose consistency of local and shared
       pinning records. Hence it needs to be carried out in
       ProcessBarrierShmemResize() and not in AnonymousShmemResize() as
       indicated by now removed comment.
    
    If a buffer being evicted is pinned, we raise a FATAL error but this should
    improve. There are multiple options 1. to wait for the pinned buffer to get
    unpinned, 2. the backend is killed or it itself cancels the query  or 3.
    rollback the operation. Note that option 1 and 2 would require the pinning
    related local and shared records to be accessed. But we need infrastructure to
    do either of this right now.
    
    Ashutosh Bapat
    ashutosh-bapat authored and Commitfest Bot committed Jun 22, 2025
    Configuration menu
    Copy the full SHA
    1c52dfe View commit details
    Browse the repository at this point in the history
  9. Reinitialize StrategyControl after resizing buffers

    ... and BgBufferSync and ClockSweepTick adjustments
    
    The commit introduces a separate function StrategyReInitialize() instead
    of reusing StrategyInitialize() since some of the things that the second
    one does are not required in the first one. Here's list of what
    StrategyReInitialize() does and how does it differ from
    StrategyInitialize().
    
    1. When expanding the buffer pool add new buffers to the free list.
    2. When shrinking buffers, we remove any buffers, in the area being
       shrunk, from the freelist. While doing so we adjust the first and
       last free buffer pointers in the StrategyControl area. Hence nothing
       more needed after resizing.
    3. Check the sanity of the free buffer list is added after resizing.
    4. StrategyControl pointer needn't be fetched again since it should not
       change. But added an Assert to make sure the pointer is valid.
    5. &StrategyControl->buffer_strategy_lock need not be initialized again.
    6. nextVictimBuffer, completePasses and numBufferAllocs are viewed in
       the context of NBuffers. Now that NBuffers itself has changed, those
       three do not make sense. Reset them as if the server has restarted
       again.
    
    This commit introduces a flag delay_shmem_resize, which postgresql
    backends and workers can use to signal the coordinator to delay resizing
    operation. Background writer sets this flag when its scanning buffers. Background
    writer is blocked when the actual resizing is in progress. But if
    resizing is about to begin, it does not scan the buffers by returning
    from BgBufferSync(). It stops a scan in progress when it sees that the
    resizing has begun. After the resizing is finished, it adjusts the
    collected statistics according to the new size of the buffer pool at the
    end of barrier processing.
    
    Once the buffer resizing is finished, before resuming the regular
    operation, bgwriter resets the information saved so far. This
    information is viewed in the context of NBuffers and hence does not make
    sense after NBuffer has changed.
    
    Ashutosh Bapat
    ashutosh-bapat authored and Commitfest Bot committed Jun 22, 2025
    Configuration menu
    Copy the full SHA
    3afe31b View commit details
    Browse the repository at this point in the history
  10. Additional validation for buffer in the ring

    If the buffer pool has been shrunk, the buffers in the buffer list may
    not be valid anymore. Modify GetBufferFromRing to check if the buffer is
    still valid before using it. This makes GetBufferFromRing() a bit more
    expensive because of additional boolean condition. That may not be
    expensive enough to affect query performance. The alternative to that is
    more complex as explained below.
    
    The strategy object is created in CurrentMemoryContext and is not
    available in any global structure thus accessible when processing buffer
    resizing barriers. We may modify GetAccessStrategy() to register
    strategy in a global linked list and then arrange to deregister it once
    it's no more in use. Looking at the places which use
    GetAccessStrategy(), fixing all those may be some work.
    
    Ashutosh Bapat
    ashutosh-bapat authored and Commitfest Bot committed Jun 22, 2025
    Configuration menu
    Copy the full SHA
    bd061ef View commit details
    Browse the repository at this point in the history
  11. [CF 5319] v5 - Changing shared_buffers without restart

    This branch was automatically generated by a robot using patches from an
    email thread registered at:
    
    https://commitfest.postgresql.org/patch/5319
    
    The branch will be overwritten each time a new patch version is posted to
    the thread, and also periodically to check for bitrot caused by changes
    on the master branch.
    
    Patch(es): https://www.postgresql.org/message-id/my4hukmejato53ef465ev7lk3sqiqvneh7436rz64wmtc7rbfj@hmuxsf2ngov2
    Author(s): Dmitry Dolgov
    Commitfest Bot committed Jun 22, 2025
    Configuration menu
    Copy the full SHA
    227ea84 View commit details
    Browse the repository at this point in the history
Loading