# Architecture ## 1. Split of responsibilities ``` ┌─────────────────────────── node A (LEADER / primary) ──────────────────────────┐ │ Postgres │ │ ├─ pg_replica extension (.so) │ │ │ ├─ background worker: Raft node ◄────raft RPC────┐ │ │ │ │ • election, membership, cluster state │ │ │ │ │ • health checks, failover state machine │ │ │ │ │ • executes promote / rewind / reconfigure │ │ │ │ └─ SQL API: pg_replica.status() / .members() / .failover() / .add_node() │ │ └─ WAL ──────────────physical streaming replication──────────────┐ │ └──────────────────────────────────────────────────────────────────── │ ──────────┘ ▲ raft RPC │ WAL │ ▼ ┌──────────── node B (follower / standby) ──┐ ┌──────── node C (follower / standby) ──┐ │ Postgres + pg_replica bgworker (Raft) │ │ Postgres + pg_replica bgworker (Raft) │ │ recovery mode, primary_conninfo → A │ │ recovery mode, primary_conninfo → A │ └───────────────────────────────────────────┘ └───────────────────────────────────────┘ supervisor (systemd / docker restart=always) keeps each Postgres process alive; pg_replica decides its ROLE. clients reach the primary via libpq multi-host (target_session_attrs=read-write) or an HAProxy that reads pg_replica state. ``` **Two planes, deliberately separate:** - **Control plane (Raft, tiny):** elects the leader, tracks membership and the authoritative "current primary + replication topology," and drives failover. The Raft log only ever carries small control entries (membership changes, leader term, failover decisions, fencing tokens). Megabytes, not gigabytes. - **Data plane (Postgres WAL, large):** the actual rows/indexes/roles/DDL move over Postgres's own physical streaming replication. We configure it; we do not reimplement it. This is what makes "replicate *everything* including roles and DDL" true for free. See [DECISIONS.md D1](DECISIONS.md#d1-raft-replicates-control-state-not-data) for why Raft must **not** carry the data itself. ## 2. Components ### 2.1 `pg_replica` extension (Rust, via `pgrx`) Built with [`pgrx`](https://github.com/pgcentralfoundation/pgrx) so it stays in the same toolchain as ParadeDB-style extensions. Two parts: 1. **Background worker `pg_replica supervisor`** — registered via `RegisterBackgroundWorker` at `shared_preload_libraries`. Hosts: - the **Raft node** ([`openraft`](https://github.com/databendlabs/openraft), an async Raft) on a small embedded tokio runtime, with a peer transport (TCP, length-prefixed JSON RPC) that also multiplexes LSN gossip; - a durable **Raft log + snapshot store** in `$PGDATA/pg_replica/` (separate from Postgres WAL); - the **health monitor** (heartbeats + libpq `SELECT 1` probes of peers); - the **failover state machine** (detect → elect → choose → fence → promote → reconfigure → rejoin); - executors that call `pg_promote()`, edit `postgresql.auto.conf` (`primary_conninfo`), manage replication slots, and shell out to `pg_basebackup` / `pg_rewind`. 2. **SQL surface** (functions + views): - `pg_replica.status()` → this node's role, term, leader, LSNs, lag. - `pg_replica.members()` → cluster membership + health. - `pg_replica.failover([target])` → operator-initiated switchover. - `pg_replica.add_node(dsn)` / `pg_replica.remove_node(id)` → membership. - GUCs: `pg_replica.peers`, `pg_replica.raft_port`, `pg_replica.node_id`, `pg_replica.synchronous`, `pg_replica.failover_timeout_ms`, `pg_replica.bootstrap`. ### 2.2 The supervisor you already have `pg_replica` does **not** start/stop the Postgres *process* — a chicken/egg an in-Postgres worker can't solve. That job stays with **systemd** or the **Docker restart policy** (`restart: always`). pg_replica decides each running node's *role*; the OS supervisor guarantees the process keeps trying to run. This is the honest reason it can be "just an extension" and still do failover (see [DECISIONS.md D3](DECISIONS.md#d3-extension-bgworker--not-a-standalone-daemon)). ### 2.3 Client routing (not built, recommended) pg_replica only **publishes** who the primary is. Pick one: - **libpq multi-host:** `host=A,B,C ... target_session_attrs=read-write` — driver finds the writable node. Zero infra. - **HAProxy** with an httpchk against `pg_replica`'s tiny health endpoint (`/primary` → 200 only on the leader). - **Floating/virtual IP** moved by the leader on promotion. ## 3. Failover flow (primary dies) 1. **Detect.** Followers' health monitors miss `pg_replica.failover_timeout_ms` of heartbeats from the leader. Raft leadership lease expires. 2. **Elect (control plane).** Surviving nodes hold a Raft election; only the **majority partition** can elect — a minority (incl. an isolated old primary) cannot, which is what prevents split-brain. 3. **Self-fence the loser.** A primary that loses quorum **demotes itself** (drops to read-only / shuts down) via a loss-of-quorum watchdog, *before* a new one is promoted. A monotonic **fencing token** (Raft term) guards against a paused old primary resuming. 4. **Choose the most-advanced replica.** The new Raft leader queries each survivor's `pg_last_wal_replay_lsn()` / `pg_last_wal_receive_lsn()` and picks the highest, minimizing data loss. Decision is committed to the Raft log. 5. **Promote.** `SELECT pg_promote()` on the chosen standby. 6. **Reconfigure the rest.** Remaining standbys get `primary_conninfo` repointed at the new primary; slots recreated; reload. 7. **Rejoin the loser.** When the old primary returns, `pg_rewind` it against the new primary to discard diverged WAL, then start it as a standby. 8. **Publish.** New primary recorded in Raft state; routing layer (HAProxy/VIP) follows. ## 4. Data-loss posture (sync vs async) Configurable via `pg_replica.synchronous`: - **async (default):** lowest latency; failover may lose the last unreplicated commits (bounded by replica lag). - **quorum-sync:** sets `synchronous_standby_names = 'ANY 1 (nodeB,nodeC)'` so a commit is acked by ≥1 standby → failover loses **nothing**. Cost: write latency, and writes stall if too few sync standbys are up. pg_replica keeps `synchronous_standby_names` in step with live membership so a single failure never wedges writes. ## 5. The hard problems (and our stance) | Problem | Stance | |--------|--------| | **Split-brain** | Raft majority quorum + self-fence on quorum loss + fencing token (term). Never two writable primaries. | | **Promoting a stale replica** | Always pick highest LSN among survivors; `pg_rewind` the others. | | **Old primary resurrecting** | Loss-of-quorum watchdog demotes it; term-based fencing token rejects its stale writes/role. | | **Quorum math** | Odd cluster sizes (3 → tolerate 1, 5 → tolerate 2). 2-node HA is impossible safely; document it. | | **Raft node dies with Postgres** | Fine: a down node doesn't need to vote; survivors keep quorum. OS supervisor restarts Postgres → bgworker rejoins. | | **Hung (not dead) Postgres** | Watchdog timer; if the bgworker can't make progress, it stops accepting the leader role. | | **Bootstrap / new replica** | `pg_basebackup` from the leader, register in Raft, stream. | | **Network partition flapping** | Election timeouts + leader leases + openraft's pre-vote to avoid term churn. | ## 6. On-disk / network footprint - Raft state: `$PGDATA/pg_replica/{raft-log, snapshot, meta}` — small. - Peer transport: one TCP port per node (`pg_replica.raft_port`), mTLS optional. - No external store, no JVM, no Go control plane. Memory target: a few MB of resident bgworker overhead beyond Postgres itself.