> **Plain-language companion:** [v0.21.0.md](v0.21.0.md)

## v0.21.0 — Correctness, Safety & Test Hardening

**Status: ✅ Released (2026-07-16).** Driven by findings in [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md).

> **Release Theme**
> This release closes the last known data-correctness gap (EC-01 JOIN delta
> phantom rows), reduces the `unsafe` code surface, expands unit-test coverage
> in three large untested modules, adds a parser fuzz target, and adds a
> crash-recovery test for the bgworker. A shadow/canary mode for
> `alter_stream_table` makes migrations of critical stream tables safer, and a
> `refresh.rs` module split into focused sub-modules reduces change risk.
> A Performance Tuning Cookbook consolidates scattered advice into an operator
> reference.

### EC-01 Fix — JOIN Delta Phantom Rows

| Item | Description | Effort | Ref |
|------|-------------|--------|-----|
| EC01-0 | **Q15 IMMEDIATE-mode stop-gap.** Add Q15 to `IMMEDIATE_SKIP_ALLOWLIST` pending the EC-01 fix; superseded by EC01-3 which removes it once the fix lands. | XS (1h) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.4 |
| EC01-1 | **Fix Part 1 row-id hash collision.** When EC-01 splits Part 1 into 1a (ΔQ⋈R₁) and 1b (ΔQ⋈R₀), hash only the left-side PK on Part 1b so both halves produce the same `__pgt_row_id` and weight aggregation cancels them correctly. | 3–5d | [PLAN_EDGE_CASES.md](plans/PLAN_EDGE_CASES.md) §EC-01; `src/dvm/operators/join.rs` L234–245 |
| EC01-2 | **PH-D1 phantom cleanup.** Verify PH-D1 DELETE+INSERT handles converged row ids from EC01-1; extend to cover prior-cycle phantom rows already in the stream table. | 1–2d | `src/refresh.rs` L4991–5005 |
| EC01-3 | **TPC-H Q07 + Q15 regression gate.** Remove Q07 from DIFFERENTIAL skip list; remove Q15 from IMMEDIATE skip list. Add `test_tpch_q07_ec01b_combined_delete` deterministic pass assertion. | 1d | `tests/e2e_tpch_tests.rs` L92–104 |
| EC01-4 | **Multi-cycle phantom property test.** 5,000-iteration proptest: delete right-side row while left changes; verify zero phantom accumulation after N cycles. | 1d | `src/dvm/operators/join.rs` |

### Safety & Code Quality

| Item | Description | Effort | Ref |
|------|-------------|--------|-----|
| SAF-1 | **Convert production `.unwrap()` in `sublinks.rs` to `?`.** 28 sites in `src/dvm/parser/sublinks.rs` (e.g. `from_list.head().unwrap()`, `get_ptr(i).unwrap()`) converted to `ok_or(PgTrickleError::UnsupportedPattern)`. | 2d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §2.2 |
| SAF-2 | **Unsafe reduction half-pass.** Add `list_nth_safe<T>()` helper returning `Option<PgBox<T>>`; group repeated `pg_sys::*` FFI calls into safe façades in `src/dvm/parser/types.rs`. Target: ≥40% reduction in `unsafe` block count. | 1wk | [plans/safety/PLAN_REDUCED_UNSAFE.md](plans/safety/PLAN_REDUCED_UNSAFE.md) |
| SAF-3 | **`clippy::unwrap_used` lint gate.** Add `#![deny(clippy::unwrap_used)]` in `lib.rs` outside `#[cfg(test)]`, with `#[allow]` on justified invariant sites in `dag.rs`. | 1d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §2.2 |
| OP-6 | **Non-deterministic function warning / rejection.** Reject or warn at `create_stream_table` time if query uses `now()`, `random()`, volatile UDFs without explicit `non_deterministic => true`. Pre-v1.0 safety gate. | S (2d) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §2.6 |

### Test Coverage

| Item | Description | Effort | Ref |
|------|-------------|--------|-----|
| TEST-1 | **Unit tests for `src/api/helpers.rs` (2.5k LOC).** 25+ unit tests covering query validation, schema helpers, and CDC orchestration utilities. | 3d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.1 |
| TEST-2 | **Unit tests for `src/api/diagnostics.rs` (1.5k LOC).** 15+ unit tests covering `explain_st`, `health_summary`, and `cache_stats` formatting logic. | 2d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.1 |
| TEST-3 | **Unit tests for `src/dvm/parser/rewrites.rs` (5.9k LOC).** 30+ unit tests covering each of the 7 rewrite passes: view inlining, DISTINCT ON, GROUPING SETS, scalar SSQ in WHERE, correlated SSQ in SELECT, SubLinks in OR, multi-PARTITION BY windows. | 3d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.1 |
| TEST-4 | **Parser fuzz target (`cargo-fuzz`).** Differential fuzz: feed random SQL to the pg_trickle parser and verify it never panics; compare accepted/rejected decisions against plain `SELECT`. Target: 1h of fuzzing with zero panics. | 1wk | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.2 |
| TEST-5 | **Crash-recovery bgworker resilience test.** `pg_ctl stop -m immediate` mid-refresh; verify: no unfinalised `pgt_refresh_history` entries, WAL decoder resumes from `confirmed_lsn`, change buffer is consistent. | 3d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §6.3 |

### Architecture

| Item | Description | Effort | Ref |
|------|-------------|--------|-----|
| ARCH-1 | **Split `src/refresh.rs` (8.4k LOC) into 4 sub-modules.** `refresh/orchestrator.rs` (dispatch, status), `refresh/codegen.rs` (delta SQL generation), `refresh/phd1.rs` (PH-D1 phantom delete), `refresh/merge.rs` (MERGE strategy). Zero behaviour change — pure reorganisation. | 1wk | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §2.4 |
| ARCH-2 | **Recursive CTE fallback observability.** Log `NOTICE: falling back to FULL refresh — defining query contains WITH RECURSIVE`; expose `refresh_reason = 'recursive_cte_fallback'` tag in Prometheus metrics and `pgt_refresh_history`. | 1d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §2.5 |

### Operational Features

| Item | Description | Effort | Ref |
|------|-------------|--------|-----|
| OPS-1 | **Shadow/canary mode for `alter_stream_table`.** Optional `dry_run_shadow => true` parameter: materialises new query into `pgt_shadow_<name>` on the same schedule; `pgtrickle.canary_diff(name)` diffs against the live table. `pgtrickle.canary_promote(name)` atomically swaps. | 1wk | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §9.5 |
| OP-2 | **Prometheus HTTP endpoint in bgworker.** Tiny HTTP server (port configurable via `pg_trickle.metrics_port`) emitting all monitoring metrics in OpenMetrics format. Removes "bring your own exporter" hurdle. | S (1w) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §9.6 |
| OP-3 | **`pgtrickle.pause_all()` / `resume_all()` helpers.** Idempotent SQL wrappers for suspending all stream tables during maintenance (e.g. `pg_dump` of source tables). | XS (1d) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §9.1 |
| OP-4 | **`pgtrickle.refresh_if_stale(name, max_age)` convenience wrapper.** Application-level staleness gating without custom procedural code. | XS (1d) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §9.1 |
| OP-5 | **`pgtrickle.stream_table_definition(name)` helper.** Single-row fetch of original query, refresh mode, schedule, and status for auditing / blue-green migrations. | XS (1d) | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §9.1 |

### Documentation

| Item | Description | Effort | Ref |
|------|-------------|--------|-----|
| DOC-1 | **Performance Tuning Cookbook.** New `docs/PERFORMANCE_COOKBOOK.md`: symptom → likely cause → GUC to tune → measurement rows. Consolidates advice from FAQ, TROUBLESHOOTING, SCALING, and BENCHMARK. | 3d | [PLAN_OVERALL_ASSESSMENT.md](plans/PLAN_OVERALL_ASSESSMENT.md) §7.2 |

### Implementation Phases

| Phase | Description | Duration |
|-------|-------------|----------|
| EC01 | EC-01 fix: Q15 stop-gap + `join.rs` hash + `refresh.rs` PH-D1 + TPC-H validation | Days 1–8 |
| SAF | Safety pass: `unwrap`→`?`, unsafe reduction, lint gate, volatile-fn warning | Days 9–15 |
| TEST | Unit test campaign (3 files) + fuzz target + resilience test | Days 16–24 |
| ARCH | `refresh.rs` split + recursive CTE observability | Days 25–29 |
| OPS+DOC | Shadow/canary mode + Prometheus endpoint + API helpers + Performance Cookbook | Days 30–40 |

> **v0.21.0 total: ~6–8 weeks** (EC-01 fix + safety hardening + API ergonomics + Prometheus endpoint + test coverage + module refactor + shadow mode + docs)

**Exit criteria:**
- [x] EC01-0: Q15 added to `IMMEDIATE_SKIP_ALLOWLIST` as stop-gap
- [x] EC01-1/EC01-2: `test_tpch_q07_ec01b_combined_delete` passes deterministically
- [x] EC01-3: Q07 and Q15 removed from IMMEDIATE/DIFFERENTIAL skip allowlists
- [x] EC01-4: Multi-cycle phantom proptest passes 5,000 iterations
- [x] SAF-1: All 28 production `.unwrap()` sites in `sublinks.rs` converted to `?`
- [x] SAF-2: `unsafe` block count reduced by ≥40%
- [x] SAF-3: `clippy::unwrap_used` lint gate passes with zero violations in non-test code
- [x] OP-6: `create_stream_table` warns or rejects queries using `now()`, `random()`, volatile UDFs without `non_deterministic => true`
- [x] TEST-1/2/3: ≥70 new unit tests across 3 previously-untested files
- [x] TEST-4: Fuzz target runs 1h with zero panics
- [x] TEST-5: Crash-recovery test passes deterministically
- [x] ARCH-1: `refresh.rs` split into 4 sub-modules; all existing tests pass unchanged
- [x] ARCH-2: `refresh_reason = 'recursive_cte_fallback'` visible in Prometheus/NOTIFY
- [x] OPS-1: `canary_diff()` / `canary_promote()` API functional with E2E tests
- [x] OP-2: Prometheus HTTP endpoint accessible at `pg_trickle.metrics_port`; all monitoring metrics present
- [x] OP-3: `pgtrickle.pause_all()` / `resume_all()` work idempotently; E2E test passes
- [x] OP-4: `pgtrickle.refresh_if_stale(name, max_age)` correctly gates refresh by age
- [x] OP-5: `pgtrickle.stream_table_definition(name)` returns accurate single-row result
- [x] DOC-1: `docs/PERFORMANCE_COOKBOOK.md` published
- [x] Extension upgrade path tested (`0.20.0 → 0.21.0`)
- [x] `just check-version-sync` passes


---