Postgres Structure and Query Deep Dive

In an effort to understand Postgres better I’ve worked my way through how a Postgres database is structured and executes queries.

Process Model

Postgres uses a process-per-worker model. That means every connection is handled by a separate operating system process forked off the initial process.

Subsystem Overview

Postgres consists of several components that all work together to process data and ensure it’s reliably written and can be looked up efficiently. The following are the most important components, but there more, other components that are beyond the scope of this post.

Postmaster

This is the first component started during server startup. It validates the configuration and data directory and then allocates shared memory that is shared among all the postgres processes, including the shared_buffers, WAL buffers, and lock tables. It loads shared libraries for extensions, starts startup tasks (checkpointer, bgwriter, wal writer, autovacuum launcher, and if needed, startup/recovery process), and then starts listening for connections. For each connection it will start a new backend via fork. It also launches other processes on demand via fork.

Postgres uses memory mapped virtual memory to share cache and other structures among all its processes. This memory is allocated on startup as a continuous slab of virtual memory of fixed size and contains the shared buffer pool, the WAL buffer, and a few other regions. Child processes automatically inherit these memory mapping after the fork.

Backends

Backends serve individual connections and are forked off the postmaster. They inherit the shared memory regions via copy-on-write pages. This is how all postgres processes share buffers and caches. Backends accept queries, parse them, plan them, and execute them. They also read pages that are not in the shared buffers, flush dirty pages to disk, and they write WAL buffers to disk.

Background Writer

The bgwriter is a background process that periodically flushes dirty pages from shared buffers to disk. By having a separate process that writes dirty pages Postgres relieves the backends from having to perform additional IO.

Autovacuum Launcher & Worker

The autovacuum launcher starts autovacuum workers that perform VACUUM and ANALYZE commands to clean up dead tuples and update statistics.

WAL Writer

https://www.postgresql.org/docs/current/runtime-config-wal.html#GUC-WAL-WRITER-DELAY

flushes (in-memory) WAL to disk every wal_writer_delay=200ms, mostly for synchronous_commit=off or large transactions that fill WAL buffers

Other Subsystems

There are other additional subsystems which I might explore another time.

  • logical replication worker: applies changes from a logical replication subscription
  • walsender: sends WAL entries to other machines
  • walreceiver: accepts WAL entries from other machines
  • archiver: moves WAL segments to archive; not sure yet what this is for
  • syslogger: manages and writes logs
  • walsummarizer: (pg17+) for incremental backup support?

A New Connection

Before we can submit a query a client has to establish a session with Postgres. After connecting, the postmaster, the component in Postgres that handles incoming connections, will handle authentication and then fork a new process called a backend to handle that new session. This new process exclusively handles the new session for its lifetime.

From the parent process the new backend inherits several memory ranges as copy-on-write pages:

  • the shared buffers
  • the WAL buffers
  • a few others, including lock tables

The backend worker locally allocates memory as needed to perform query related work. Most important here is likely work_mem which controls how much memory can be allocated for individual plan nodes while executing a query. That means a single query can allocate multiple times work_mem. If an operation exceeds work_mem extraneous bytes are spilled to disk which is slow.

Query Arrives

When we submit a query at a Postgres backend, it is processed in several stages.

  1. Parser Stage: the query is parsed into a raw parse tree
  2. Semantic Analysis: identifiers such as tables and columns are resolved
  3. Rewrites: rules are applied, which appears to be triggers and some other manually created rules.
  4. Planner operates in stages:
    1. flattens queries, simplifies constant expressions, canonicalize where expressions
    2. find all access paths to relations and how to access them (seq scan, index scan, bitmap scan), build join paths (nested loop, hash join, merge join), and estimates cost of each path using pg_statistic
    3. using the cheapest path and build plan by adding necessary sort/material (place all tuples from a relation in memory, e.g. in CTE)/gather nodes (collect results from parallel workers)
  5. Execute the plan recursively from top to bottom. At this phase necessary locks will be acquired, buffers read from shared pool or disk, and if needed, WAL entries added.
  6. Return results.

How the WAL works

Roughly:

  • change is written to WAL buffer in memory that describes the change
    • two phase process, first acquire one of NUM_XLOGINSERT_LOCKS spin lock, request an exclusive byte range in WAL buffers. By default 8 locks in total are available.
    • once byte range is acquired, copy new WAL record into WAL buffers
  • only then is the corresponding page (either/both heap/index) in memory updated and marked as dirty
  • on COMMIT the backend will write the WAL entry to disk and only then report success (unless synchronous_commit=off)
    • acquire lock WALWriteLock, then write all pending records, including records from other backends, to disk
    • if written successfully, report success to the client

The exact steps differ based on the type of query:

  • SELECT has to find free space via FSM lookup, write the new tuple to the WAL, add the new tuple in the shared buffer, and update indexes.
  • DELETE marks a page as deleted by setting xmax to the current XID, log the delete to the WAL. Cleanup of heap pages and index pages is done by vacuum.
  • UPDATE performs a delete and insert. If no index or page updates are needed XLOG_HEAP_HOT_UPDATE is written to the WAL, XLOG_HEAP_UPDATE otherwise.

Sources

I’ve drawn heavily on these excellent sources:

ilikeorangutans

Jakob Külzer’s personal blog


2026-05-02

Post Edits
  • 2026-05-04: Final version covering structure and query