# Design decisions (ADR-style) Each entry: the decision, the rationale, and the alternative we rejected. --- ## D1 — Raft replicates control state, NOT data **Decision.** The Raft log carries only small cluster-control entries (membership, leader term, failover decisions, fencing tokens). The actual database content is replicated by **Postgres physical (WAL) streaming replication**. **Why.** "Use Raft to replicate the full database" means putting every write through a Raft log and rebuilding storage on top of it — that is what CockroachDB and TiKV do, and it is a *multi-year* engineering effort that throws away Postgres's WAL, MVCC, and on-disk format. It would also make us *not Postgres*. Physical WAL replication already copies **everything we want** — heap, indexes, and the shared catalog `pg_authid` (roles/users + SCRAM verifiers) — correctly and fast. So Raft's job is consensus on *who leads*, not moving bytes. **Rejected.** Data-over-Raft (reimplementing the storage engine). Too big, and it discards the entire reason to stay on Postgres. **Consequence.** "Replicates roles and DDL" is satisfied by physical replication, not by us. We must never claim otherwise. --- ## D2 — Physical replication, not logical **Decision.** Use streaming **physical** replication as the data plane. **Why.** Only physical replication copies **global objects** — roles live in the cluster-wide `pg_authid`, which logical replication and pgactive explicitly do **not** carry. Since "replicate roles + DDL + everything" is the whole point, physical is the only fit. Replicas are read-only; that's acceptable (single-writer HA, like a Mongo replica set). **Rejected.** Logical replication (no roles/DDL/globals) and active-active (pgactive: no global objects, conflict hell). Both fail the core requirement. --- ## D3 — Extension + bgworker, not a standalone daemon **Decision.** Ship as a Postgres **extension** whose **background worker** hosts the Raft node and orchestration. Rely on the existing OS supervisor (systemd / Docker `restart: always`) for the Postgres *process* lifecycle. **Why.** The user wants "a Postgres plugin," and most of the work *can* live in a bgworker: a node whose Postgres is down doesn't need to vote (survivors hold quorum); standbys being promoted are *up*, so their bgworker can `pg_promote()` itself; a deposed primary that's up runs its own bgworker and self-demotes on quorum loss. The one thing a bgworker genuinely cannot do is **start a Postgres that is down** (chicken/egg) — so we delegate *only that* to systemd/Docker, which every deployment already has. **Rejected.** A separate Go/Rust daemon à la Patroni/Stolon. It would work, but it's heavier and contradicts the "plugin" goal. We accept one honest limitation (process lifecycle is the supervisor's job) to keep the plugin form factor. **Risk.** A *hung but not dead* Postgres can wedge its bgworker. Mitigation: a watchdog timer that makes a stuck node refuse/relinquish leadership. --- ## D4 — Embedded Raft, no external DCS **Decision.** Embed Raft (`openraft`) inside the extension. No etcd, Consul, or k8s. **Why.** The stated goal is fewer moving parts than CloudNativePG/Patroni-etcd. An embedded quorum removes an entire external system to deploy, secure, and operate. Patroni's `raft` (pysyncobj) mode proves the pattern is viable; we do it natively and lighter. **Rejected.** External DCS (operational weight) and single-monitor designs like pg_auto_failover (the monitor is itself a SPOF and not a quorum). --- ## D5 — Rust + pgrx + openraft **Decision.** Implement in Rust: the extension via **pgrx**, consensus via **openraft** (async, event-driven Raft) hosted on a small embedded tokio runtime inside the background worker. **Why.** "Light on resources" rules out the JVM and argues against a Go control plane; Rust gives a small static `.so` with no GC pauses in the failover path. pgrx keeps us in the same toolchain as the ParadeDB-style stack already in use. openraft leaves storage and transport to us (a single versioned `Decision` value plus a tiny TCP/JSON RPC — both trivial here). --- ## D6 — Quorum-only, odd node counts **Decision.** Support 3 and 5 nodes; refuse to pretend 2-node is safe. **Why.** Raft needs a majority; 2 nodes can't form a safe majority on partition (both think they're right, or neither can proceed). 3 tolerates 1 failure, 5 tolerates 2. We document this loudly rather than offering a footgun. --- ## D7 — Safety over availability by default **Decision.** Default to **never two writable primaries**, even at the cost of a brief write outage during failover. Synchronous (zero-loss) mode is opt-in. **Why.** A search/cache can be rebuilt; a system of record that double-writes is corrupted. Fencing + quorum + most-advanced-replica selection prioritize correctness. Operators who want zero data loss enable quorum-sync and accept the latency.