> **Plain-language companion:** [v0.21.0.md](v0.21.0.md) ## v0.21.0 — Correctness, Safety & Test Hardening **Status: ✅ Released (2026-07-16).** Driven by findings in [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md). > **Release Theme** > This release closes the last known data-correctness gap (EC-01 JOIN delta > phantom rows), reduces the `unsafe` code surface, expands unit-test coverage > in three large untested modules, adds a parser fuzz target, and adds a > crash-recovery test for the bgworker. A shadow/canary mode for > `alter_stream_table` makes migrations of critical stream tables safer, and a > `refresh.rs` module split into focused sub-modules reduces change risk. > A Performance Tuning Cookbook consolidates scattered advice into an operator > reference. ### EC-01 Fix — JOIN Delta Phantom Rows | Item | Description | Effort | Ref | |------|-------------|--------|-----| | EC01-0 | **Q15 IMMEDIATE-mode stop-gap.** Add Q15 to `IMMEDIATE_SKIP_ALLOWLIST` pending the EC-01 fix; superseded by EC01-3 which removes it once the fix lands. | XS (1h) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.4 | | EC01-1 | **Fix Part 1 row-id hash collision.** When EC-01 splits Part 1 into 1a (ΔQ⋈R₁) and 1b (ΔQ⋈R₀), hash only the left-side PK on Part 1b so both halves produce the same `__pgt_row_id` and weight aggregation cancels them correctly. | 3–5d | [PLAN_EDGE_CASES.md](plans/PLAN_EDGE_CASES.md) §EC-01; `src/dvm/operators/join.rs` L234–245 | | EC01-2 | **PH-D1 phantom cleanup.** Verify PH-D1 DELETE+INSERT handles converged row ids from EC01-1; extend to cover prior-cycle phantom rows already in the stream table. | 1–2d | `src/refresh.rs` L4991–5005 | | EC01-3 | **TPC-H Q07 + Q15 regression gate.** Remove Q07 from DIFFERENTIAL skip list; remove Q15 from IMMEDIATE skip list. Add `test_tpch_q07_ec01b_combined_delete` deterministic pass assertion. | 1d | `tests/e2e_tpch_tests.rs` L92–104 | | EC01-4 | **Multi-cycle phantom property test.** 5,000-iteration proptest: delete right-side row while left changes; verify zero phantom accumulation after N cycles. | 1d | `src/dvm/operators/join.rs` | ### Safety & Code Quality | Item | Description | Effort | Ref | |------|-------------|--------|-----| | SAF-1 | **Convert production `.unwrap()` in `sublinks.rs` to `?`.** 28 sites in `src/dvm/parser/sublinks.rs` (e.g. `from_list.head().unwrap()`, `get_ptr(i).unwrap()`) converted to `ok_or(PgTrickleError::UnsupportedPattern)`. | 2d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §2.2 | | SAF-2 | **Unsafe reduction half-pass.** Add `list_nth_safe()` helper returning `Option>`; group repeated `pg_sys::*` FFI calls into safe façades in `src/dvm/parser/types.rs`. Target: ≥40% reduction in `unsafe` block count. | 1wk | [plans/safety/PLAN_REDUCED_UNSAFE.md](plans/safety/PLAN_REDUCED_UNSAFE.md) | | SAF-3 | **`clippy::unwrap_used` lint gate.** Add `#![deny(clippy::unwrap_used)]` in `lib.rs` outside `#[cfg(test)]`, with `#[allow]` on justified invariant sites in `dag.rs`. | 1d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §2.2 | | OP-6 | **Non-deterministic function warning / rejection.** Reject or warn at `create_stream_table` time if query uses `now()`, `random()`, volatile UDFs without explicit `non_deterministic => true`. Pre-v1.0 safety gate. | S (2d) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §2.6 | ### Test Coverage | Item | Description | Effort | Ref | |------|-------------|--------|-----| | TEST-1 | **Unit tests for `src/api/helpers.rs` (2.5k LOC).** 25+ unit tests covering query validation, schema helpers, and CDC orchestration utilities. | 3d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.1 | | TEST-2 | **Unit tests for `src/api/diagnostics.rs` (1.5k LOC).** 15+ unit tests covering `explain_st`, `health_summary`, and `cache_stats` formatting logic. | 2d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.1 | | TEST-3 | **Unit tests for `src/dvm/parser/rewrites.rs` (5.9k LOC).** 30+ unit tests covering each of the 7 rewrite passes: view inlining, DISTINCT ON, GROUPING SETS, scalar SSQ in WHERE, correlated SSQ in SELECT, SubLinks in OR, multi-PARTITION BY windows. | 3d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.1 | | TEST-4 | **Parser fuzz target (`cargo-fuzz`).** Differential fuzz: feed random SQL to the pg_trickle parser and verify it never panics; compare accepted/rejected decisions against plain `SELECT`. Target: 1h of fuzzing with zero panics. | 1wk | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.2 | | TEST-5 | **Crash-recovery bgworker resilience test.** `pg_ctl stop -m immediate` mid-refresh; verify: no unfinalised `pgt_refresh_history` entries, WAL decoder resumes from `confirmed_lsn`, change buffer is consistent. | 3d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.3 | ### Architecture | Item | Description | Effort | Ref | |------|-------------|--------|-----| | ARCH-1 | **Split `src/refresh.rs` (8.4k LOC) into 4 sub-modules.** `refresh/orchestrator.rs` (dispatch, status), `refresh/codegen.rs` (delta SQL generation), `refresh/phd1.rs` (PH-D1 phantom delete), `refresh/merge.rs` (MERGE strategy). Zero behaviour change — pure reorganisation. | 1wk | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §2.4 | | ARCH-2 | **Recursive CTE fallback observability.** Log `NOTICE: falling back to FULL refresh — defining query contains WITH RECURSIVE`; expose `refresh_reason = 'recursive_cte_fallback'` tag in Prometheus metrics and `pgt_refresh_history`. | 1d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §2.5 | ### Operational Features | Item | Description | Effort | Ref | |------|-------------|--------|-----| | OPS-1 | **Shadow/canary mode for `alter_stream_table`.** Optional `dry_run_shadow => true` parameter: materialises new query into `pgt_shadow_` on the same schedule; `pgtrickle.canary_diff(name)` diffs against the live table. `pgtrickle.canary_promote(name)` atomically swaps. | 1wk | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §9.5 | | OP-2 | **Prometheus HTTP endpoint in bgworker.** Tiny HTTP server (port configurable via `pg_trickle.metrics_port`) emitting all monitoring metrics in OpenMetrics format. Removes "bring your own exporter" hurdle. | S (1w) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §9.6 | | OP-3 | **`pgtrickle.pause_all()` / `resume_all()` helpers.** Idempotent SQL wrappers for suspending all stream tables during maintenance (e.g. `pg_dump` of source tables). | XS (1d) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §9.1 | | OP-4 | **`pgtrickle.refresh_if_stale(name, max_age)` convenience wrapper.** Application-level staleness gating without custom procedural code. | XS (1d) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §9.1 | | OP-5 | **`pgtrickle.stream_table_definition(name)` helper.** Single-row fetch of original query, refresh mode, schedule, and status for auditing / blue-green migrations. | XS (1d) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §9.1 | ### Documentation | Item | Description | Effort | Ref | |------|-------------|--------|-----| | DOC-1 | **Performance Tuning Cookbook.** New `docs/PERFORMANCE_COOKBOOK.md`: symptom → likely cause → GUC to tune → measurement rows. Consolidates advice from FAQ, TROUBLESHOOTING, SCALING, and BENCHMARK. | 3d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §7.2 | ### Implementation Phases | Phase | Description | Duration | |-------|-------------|----------| | EC01 | EC-01 fix: Q15 stop-gap + `join.rs` hash + `refresh.rs` PH-D1 + TPC-H validation | Days 1–8 | | SAF | Safety pass: `unwrap`→`?`, unsafe reduction, lint gate, volatile-fn warning | Days 9–15 | | TEST | Unit test campaign (3 files) + fuzz target + resilience test | Days 16–24 | | ARCH | `refresh.rs` split + recursive CTE observability | Days 25–29 | | OPS+DOC | Shadow/canary mode + Prometheus endpoint + API helpers + Performance Cookbook | Days 30–40 | > **v0.21.0 total: ~6–8 weeks** (EC-01 fix + safety hardening + API ergonomics + Prometheus endpoint + test coverage + module refactor + shadow mode + docs) **Exit criteria:** - [x] EC01-0: Q15 added to `IMMEDIATE_SKIP_ALLOWLIST` as stop-gap - [x] EC01-1/EC01-2: `test_tpch_q07_ec01b_combined_delete` passes deterministically - [x] EC01-3: Q07 and Q15 removed from IMMEDIATE/DIFFERENTIAL skip allowlists - [x] EC01-4: Multi-cycle phantom proptest passes 5,000 iterations - [x] SAF-1: All 28 production `.unwrap()` sites in `sublinks.rs` converted to `?` - [x] SAF-2: `unsafe` block count reduced by ≥40% - [x] SAF-3: `clippy::unwrap_used` lint gate passes with zero violations in non-test code - [x] OP-6: `create_stream_table` warns or rejects queries using `now()`, `random()`, volatile UDFs without `non_deterministic => true` - [x] TEST-1/2/3: ≥70 new unit tests across 3 previously-untested files - [x] TEST-4: Fuzz target runs 1h with zero panics - [x] TEST-5: Crash-recovery test passes deterministically - [x] ARCH-1: `refresh.rs` split into 4 sub-modules; all existing tests pass unchanged - [x] ARCH-2: `refresh_reason = 'recursive_cte_fallback'` visible in Prometheus/NOTIFY - [x] OPS-1: `canary_diff()` / `canary_promote()` API functional with E2E tests - [x] OP-2: Prometheus HTTP endpoint accessible at `pg_trickle.metrics_port`; all monitoring metrics present - [x] OP-3: `pgtrickle.pause_all()` / `resume_all()` work idempotently; E2E test passes - [x] OP-4: `pgtrickle.refresh_if_stale(name, max_age)` correctly gates refresh by age - [x] OP-5: `pgtrickle.stream_table_definition(name)` returns accurate single-row result - [x] DOC-1: `docs/PERFORMANCE_COOKBOOK.md` published - [x] Extension upgrade path tested (`0.20.0 → 0.21.0`) - [x] `just check-version-sync` passes ---