# v0.51.0 — Citus Chaos Resilience & Documentation Truth > **Theme:** Prove that the Citus distributed integration survives real failure > scenarios; remove the deprecated event_driven_wake dead weight; bring every > piece of documentation into full alignment with the implemented code. ## Why This Release Citus distributed support has been shipping since v0.32.0 but has never had a chaos or resilience test suite. There are zero tests validating behaviour under node failure, shard rebalance, or network partition — scenarios that production deployments encounter routinely. This is the highest remaining test coverage risk in the project. Simultaneously, several documentation items have drifted: the pg_tide extraction boundary is not clearly described in the architecture doc, the deprecated `event_driven_wake` GUC still clutters configuration docs and emits runtime warnings, and the recursive CTE strategy selection logic is a complete black box for operators debugging refresh behaviour. This release closes all of these together as the final gate before v1.0. ## Deliverables ### FEAT-10-01 — Citus Chaos Test Rig Build a docker-compose-based chaos test rig for Citus distributed scenarios in `tests/e2e_citus_chaos_tests.rs`. The rig requires: - `docker/docker-compose.citus.yml` — 1 coordinator + 3 worker nodes - Utility functions in `tests/common/citus_chaos.rs` for injecting failures **Three chaos scenarios (all mandatory for release):** **Scenario 1: Coordinator restart during active refresh** 1. Create a distributed stream table across 3 workers. 2. Start a refresh cycle. 3. Restart the coordinator container mid-refresh. 4. Verify the refresh retries and completes correctly on reconnect. 5. Run 5 subsequent refresh cycles; assert no phantom rows or missing rows. **Scenario 2: Worker node kill with shard redistribution** 1. Create distributed source tables with data distributed across 3 workers. 2. Kill one worker container. 3. Trigger a shard rebalance (`SELECT rebalance_table_shards()`). 4. Verify the stream table refreshes correctly after rebalance completes. 5. Assert CDC change buffers are consistent post-recovery. **Scenario 3: Network partition simulation** 1. Use `docker network disconnect` to isolate one worker. 2. Insert rows on the remaining workers. 3. Reconnect the isolated worker. 4. Verify the stream table converges to the correct state within 3 refresh cycles with no data loss. The chaos tests are added to `stability-tests.yml` (not to `ci.yml` PR gate) and run nightly alongside the existing G17-SOAK and G17-MDB tests. ### CQ-10-02 — Remove Deprecated event_driven_wake GUC The `pg_trickle.event_driven_wake` GUC has been non-functional since PostgreSQL background workers cannot use `LISTEN`/`NOTIFY`. It has been deprecated with default `false` since v0.39.0. It still emits a runtime WARNING when set to `true` and adds dead code paths. Remove entirely: - GUC registration in `src/config.rs` (lines 375–419) - Runtime warning emission in `src/scheduler/mod.rs` (lines 2557–2563) - `event_driven = false` stub in scheduling logic - `docs/CONFIGURATION.md` entry (replace with a migration note pointing to the `scheduler_interval_ms` + `wake_debounce_ms` alternatives) Add to `CHANGELOG.md` under a **Breaking Changes** heading for v0.51.0: > `pg_trickle.event_driven_wake` has been removed. This GUC had no effect > since v0.39.0. Remove it from `postgresql.conf` to avoid an unknown GUC > warning on upgrade. Add SQL migration step that ignores the removed GUC gracefully: ```sql -- v0.51: event_driven_wake removed; no data change needed ``` ### DOC-10-01 — ARCHITECTURE.md pg_tide Boundary `docs/ARCHITECTURE.md` still describes outbox, inbox, and the relay binary as pg_trickle subsystems. Since v0.46.0 these have been extracted to `trickle-labs/pg-tide`. Add a new **§ pg_tide Integration** section to ARCHITECTURE.md that: - Explains the v0.46.0 extraction decision (focused extension boundary, separate release cadence for event messaging). - Describes what remains in pg_trickle: `attach_outbox()` integration hook, change buffer subscription for pg_tide consumers. - Describes what lives in pg_tide: `enable_outbox()`, `poll_outbox()`, consumer groups, claim-check mode, the relay binary. - Links to the pg_tide repository for full API documentation. Remove the `src/api/outbox.rs` and `src/api/inbox.rs` references from the module layout diagram and replace with the integration boundary description. ### DOC-10-02 — Configuration Deprecation and Removal Banners `docs/CONFIGURATION.md` lists deprecated GUCs without clear visual markers. After the removal of `event_driven_wake`, audit all remaining deprecated GUC entries and add consistent formatting: - **Removed in v0.51.0**: `event_driven_wake` — add a migration note. - **Deprecated (accepted, ignored)**: `merge_planner_hints`, `user_triggers='on'` — add `> ⚠️ **Deprecated** — accepted for backwards compatibility but has no effect. Will be removed in a future major version.` callout. ### DOC-10-03 — Recursive CTE Strategy Selection Heuristic `docs/ARCHITECTURE.md` mentions three recursive CTE strategies (semi-naive, DRed, recomputation fallback) but does not document the selection heuristic. This is a black box for operators debugging slow or incorrect recursive-view refreshes. Add a new **§ Recursive CTE Strategy Selection** subsection under the DVM Engine section: 1. **Selection criteria**: - Tier 1 (inline expansion): CTE referenced once, non-recursive → expand inline into the query; no differential overhead. - Tier 2 (shared delta): CTE referenced 2+ times, non-recursive → single delta computation shared across all reference sites. - Tier 3a (semi-naive): CTE is recursive with monotone operators only (UNION ALL, no NOT EXISTS / aggregation) → semi-naive evaluation with frontier-bounded delta. - Tier 3b (DRed): CTE is recursive with non-monotone operators, base tables have primary keys → DRed (Deletion Propagation in Recursive Datalog) delta. - Tier 3c (recomputation): CTE is recursive with non-monotone operators, no primary keys, or cycle in dependency graph → full recomputation. 2. **Observability**: Document that `explain_stream_table(st_name)` returns a `recursive_cte_strategy` field showing which tier was selected and why. 3. **Example EXPLAIN output**: Show a concrete example for a recursive hierarchical closure query demonstrating Tier 3a (semi-naive) selection. ### COR-10-02 — Document CDC-Fires-When-Disabled Behavior `pg_trickle.enabled = false` stops the scheduler from dispatching refreshes but CDC triggers continue to fire and write to change buffers. This is by design (keeps buffers ready for immediate use when re-enabled) but is undocumented and surprises operators who expect `enabled = false` to be a complete quiet mode. Add to `docs/CONFIGURATION.md` under `pg_trickle.enabled`: > **Note on CDC triggers**: Setting `enabled = false` stops the scheduler from > refreshing stream tables but does **not** disable CDC trigger execution. > Change buffers continue to accumulate. This is intentional: when the extension > is re-enabled, stream tables can refresh immediately from the buffered changes > rather than performing a full table scan. > > To fully quiesce CDC overhead during extended maintenance, use > `pgtrickle.drain()` before disabling, then `DROP TRIGGER` the CDC triggers > manually and recreate them via `pgtrickle.repair_stream_table()` when > re-enabling. ## Testing - `just test-unit` — passes; deprecated GUC code paths fully removed - `just test-integration` — passes - `just test-light-e2e` — passes - `just test-soak` — G17-SOAK still green after `event_driven_wake` removal - Citus chaos rig: `just test-citus-chaos` (new recipe) — all three scenarios pass: coordinator restart, worker kill + rebalance, network partition - Documentation: `scripts/gen_catalogs.py --check` passes with updated GUC list