> **Plain-language companion:** [v0.39.0.md](v0.39.0.md) ## v0.39.0 — Operational Truthfulness & Distributed Hardening **Status: Planned.** Derived from [plans/PLAN_OVERALL_ASSESSMENT_8.md](plans/PLAN_OVERALL_ASSESSMENT_8.md) §4-§13; action items A02-A08, A10, A12-A15. > **Release Theme** > v0.38.0 closes EC-01 and proves join correctness. v0.39.0 makes the operational > and distributed story honest. It fixes misleading docs (backpressure, wake), > wires partially-implemented features (`enforce_backpressure`, `event_driven_wake`), > adds generated configuration and upgrade docs, proves behavior under failure > injection and scale (Citus chaos, CDC hold semantics), and adds richer > diagnostics to explain why DIFFERENTIAL refresh succeeds, falls back, or > collapses instead of leaving maintainers to debug manually after merge. --- ### Features | ID | Title | Effort | Priority | |----|-------|--------|----------| | O39-1 | Truthful backpressure semantics with hysteresis or explicit deprecation | M | P0 | | O39-2 | Wake truthfulness: docs, warnings, and test repair | S | P0 | | O39-3 | Generated `CONFIGURATION.md` plus corrected operator docs | M | P0 | | O39-4 | `UPGRADING.md` through v0.37 and explicit TRUNCATE/CDC semantics | S | P0 | | O39-5 | OpenTelemetry operator guide and live collector proof | M | P1 | | O39-6 | SQLSTATE-first SPI retry classification rollout | M | P1 | | O39-7 | Citus coordinator/worker E2E and chaos harness | L | P1 | | O39-8 | Durable capture-and-hold vs discard-and-reinit semantics | M | P1 | | O39-9 | Generated-SQL artifact API and richer `explain_stream_table()` visualizer | M | P1 | | O39-10 | TPC-H per-query `EXPLAIN` artifacts and p99/temp-spill regression gate | M | P1 | | O39-11 | SQLancer light PR mode for DIFF-vs-FULL oracle coverage | M | P2 | | O39-12 | WAL decoder, merge SQL, DAG, and snapshot fuzz expansion | M/L | P2 | | O39-13 | Inbox/outbox reliability helpers plus property-test coverage | M | P2 | | O39-14 | PR-scoped upgrade E2E slice for SQL-facing changes | M | P2 | **O39-1 — Truthful backpressure semantics with hysteresis or explicit deprecation.** `pg_trickle.enforce_backpressure` currently looks operator-ready without an actual control path that suppresses or redirects writes safely. v0.39.0 either wires it into a real stateful path with explicit hysteresis and observable status (capture-and-hold mode keeps durable CDC while pausing refreshes), or demotes the GUC as experimental/reserved until the durable hold mode is properly designed. Either way, document the current `cdc_paused` discard semantics so operators know whether changes are captured, deferred, or dropped, and make the active mode visible in status output. **O39-2 — Wake truthfulness: docs, warnings, and test repair.** `event_driven_wake` cannot rely on `LISTEN` in a PostgreSQL background worker, yet docs and tests still imply low-latency event delivery. v0.39.0 makes the warning path explicit, rewrites the wake test so poll-only behavior cannot pass as event-driven, and documents a future latch-based wake path as separate work rather than as an implied current guarantee. **O39-3 — Generated `CONFIGURATION.md` plus corrected operator docs.** Configuration docs must be generated from source metadata so defaults and new GUCs stop drifting. The generated reference covers all known GUCs, corrects the `event_driven_wake` default and behavior, includes the v0.37 vector/trace GUCs, and aligns `docs/SCALING.md` and `docs/TROUBLESHOOTING.md` with the real code. **O39-4 — `UPGRADING.md` through v0.37 and explicit TRUNCATE/CDC semantics.** `docs/UPGRADING.md` stops at v0.34 even though the codebase is at v0.37. Add v0.34 → v0.35, v0.35 → v0.36, and v0.36 → v0.37 guidance, then document how TRUNCATE, CDC pause, and FULL invalidation behave so operators are not left to infer semantics from trigger code. **O39-5 — OpenTelemetry operator guide and live collector proof.** Trace context capture exists, but the roadmap must include proof that spans can be exported to a live collector without surprising failure behavior. Add an operator guide for OTLP/Jaeger/Tempo configuration plus a dockerized collector test that verifies success, timeout, and endpoint-failure handling. **O39-6 — SQLSTATE-first SPI retry classification rollout.** The SQLSTATE classifier exists, but hot paths still classify retryability from English message text. Introduce a single helper that constructs `PgTrickleError::SpiErrorCode` from SPI failures on scheduler, refresh, and catalog paths so retry behavior is locale-independent and stable across message wording changes. **O39-7 — Citus coordinator/worker E2E and chaos harness.** The assessment found a significant Citus surface with no real topology or chaos coverage. v0.39.0 adds a coordinator-plus-workers rig that verifies lease acquisition, worker death and restart, coordinator restart mid-lease, shard rebalance churn, and stale `pgt_worker_slots` cleanup. Distributed correctness stops being an assumption and becomes a tested contract. **O39-8 — Durable capture-and-hold vs discard-and-reinit semantics.** The current `cdc_paused` behavior is too easy to misread as a harmless throttle when it can actually drop change capture. Split the operator surface into an explicit capture-and-hold mode that keeps durable CDC while pausing refreshes, and a separate discard-and-reinitialize mode for emergencies. The active mode must be visible in status output so operators know which stream tables require catch-up or reinit. **O39-9 — Generated-SQL artifact API and richer `explain_stream_table()` visualizer.** Maintainers need to see why a stream table is DIFF-safe, why it fell back to FULL, which operators are row-id sensitive, and which generated SQL fragments exist for the current definition. Extend `explain_stream_table()` or add a dedicated generated-SQL artifact API that exposes operator-tree summary, SQL hashes, fallback reasons, and phase-level SQL snippets without turning on broad debug logging. **O39-10 — TPC-H per-query `EXPLAIN` artifacts and p99/temp-spill regression gate.** Aggregate benchmark dashboards are not enough for Q04/Q05/Q07/Q08/Q09/Q20/Q22 style collapse queries. v0.39.0 records generated SQL hashes, `EXPLAIN ANALYZE BUFFERS`, temp blocks, phase timings, and measured p50/p99 so CI can track query-shape regressions instead of only overall throughput. The gate must be baseline-driven, not hard-coded from wishful numbers. **O39-11 — SQLancer light PR mode for DIFF-vs-FULL oracle coverage.** Weekly SQLancer runs are valuable but too slow to protect every PR. Add a small oracle mode to PR CI that exercises a measured subset of SQLancer-generated workloads and compares DIFF against FULL so obvious query-engine regressions are caught before merge. **O39-12 — WAL decoder, merge SQL, DAG, and snapshot fuzz expansion.** Current fuzzing is concentrated in parser/config helpers. Expand it into the state machines and code generators most likely to fail under adversarial input: logical decoding payloads, merge-template inputs, DAG graph mutations, and snapshot/restore identifier and column-list generation. **O39-13 — Inbox/outbox reliability helpers plus property-test coverage.** Inbox/outbox helper tests exist, but they do not prove the actual delivery contract. Extract pure helpers for envelope encoding/decoding, dedup-key validation, lease math, offset transitions, replay behavior, and NULL payload handling, then cover them with property tests in addition to end-to-end flows. **O39-14 — PR-scoped upgrade E2E slice for SQL-facing changes.** Full upgrade E2E is currently daily/manual only. Add a smaller PR matrix that triggers when SQL scripts, `src/config.rs`, `src/cdc.rs`, or SQL-facing API files change so upgrade regressions are more likely to fail before merge. ### Test Coverage | ID | Title | Effort | Priority | |----|-------|--------|----------| | T-O39-1 | Citus chaos suite with worker and coordinator failure injection | L | P0 | | T-O39-2 | Durable hold vs discard semantics E2E | M | P0 | | T-O39-3 | Backpressure and wake truthfulness E2E coverage | M | P0 | | T-O39-4 | SQLancer light PR gate | M | P1 | | T-O39-5 | TPC-H artifact and p99/temp-spill regression workflow | M | P1 | | T-O39-6 | Inbox/outbox reliability property suite | M | P1 | | T-O39-7 | Live OpenTelemetry collector integration test | M | P1 | | T-O39-8 | Docs drift and upgrade-reference CI checks | S | P1 | **T-O39-1.** Create a topology-aware test harness with at least one coordinator and two workers. Kill workers mid-refresh, restart the coordinator during lease ownership, trigger rebalance churn, and verify the system neither loses change capture nor duplicates refresh ownership. **T-O39-2.** Exercise both control-plane modes (capture-and-hold vs discard). In capture-and-hold mode, writes continue to be captured while refreshes stop; in discard mode, status output and docs must clearly mark the need for reinitialization. **T-O39-3.** The wake test must assert a bound below the poll interval or explicitly assert poll-only warning behavior. Backpressure tests must create synthetic slot-lag conditions, validate hysteresis, and verify the operator-visible state matches whether writes are captured, held, or discarded. **T-O39-4.** Run a bounded SQLancer subset on PRs and compare DIFF vs FULL outputs. Keep the runtime small enough for PR CI, but publish seed values and failing queries so maintainers can reproduce any divergence quickly. **T-O39-5.** Publish per-query TPC-H artifacts for collapse-prone workloads, including SQL hash, `EXPLAIN`, temp blocks, and measured p99. Use baselines captured from the target CI hardware instead of invented thresholds. **T-O39-6.** Property tests verify envelope round-trips, dedup semantics, lease arithmetic, offset transitions, and replay rules. Keep end-to-end tests for transactional integration, but move the contract itself into fast deterministic coverage. **T-O39-7.** Run an integration test against a dockerized OTEL Collector or Jaeger instance. **T-O39-8.** Check for docs drift and confirm upgrade references are current for all versions since v0.34. ### Conflicts & Risks - **O39-1** and **O39-7** require substantial test infrastructure, but the assessment makes it clear this is no longer optional if Citus and distributed support are to remain roadmap claims. - **O39-2** and **O39-8** can increase retained change-buffer volume if operators hold for too long. The feature needs clear status visibility, limits, and runbook guidance rather than a hidden background state. - **O39-3/O39-4/O39-5** must use measured budgets. Operational docs only help if they match the actual code, and diagnostic gates only help if they are reliable enough not to be immediately disabled. ### Exit Criteria - [ ] O39-1: Truthful backpressure semantics or explicit deprecation; `cdc_paused` semantics documented - [ ] O39-2: Wake truthfulness corrected; test updated; poll-only warning present - [ ] O39-3: `CONFIGURATION.md` is auto-generated from source; includes all GUCs; defaults correct - [ ] O39-4: `UPGRADING.md` covers v0.34–v0.37; TRUNCATE/CDC semantics explicit - [ ] O39-5: OpenTelemetry operator guide and live collector test; propagation proven - [ ] O39-6: SQLSTATE-first retry classification on scheduler, refresh, and catalog paths - [ ] O39-7: Citus topology and chaos tests pass for lease acquisition, worker death/restart, coordinator restart, and rebalance churn - [ ] O39-8: Capture-and-hold and discard-and-reinit semantics are implemented and visible through status APIs - [ ] O39-9: Generated-SQL artifact export and explain visualizer expose DIFF/FULL reasoning without broad debug logging - [ ] O39-10: TPC-H artifact workflow records SQL hash, `EXPLAIN`, temp blocks, and measured p99 for collapse-prone queries - [ ] O39-11: SQLancer light PR mode runs on pull requests and publishes reproducible seeds for failures - [ ] O39-12: New fuzz targets exist for WAL decoding, merge SQL, DAG mutation, and snapshot column-list generation - [ ] O39-13: Inbox/outbox reliability contract covered by property tests - [ ] O39-14: SQL-facing PRs trigger a smaller upgrade E2E matrix - [ ] Extension upgrade path tested (`0.38.0 → 0.39.0`) - [ ] `just check-version-sync` passes