> **Plain-language companion:** [v0.43.0.md](v0.43.0.md)

## v0.43.0 — Performance Tunability, Diagnostics & Refresh Intelligence

**Status: Planned.** Derived from [plans/PLAN_OVERALL_ASSESSMENT_9.md](../plans/PLAN_OVERALL_ASSESSMENT_9.md) §Dimension 3 (Performance), §Dimension 5 (Feature Gaps), §Dimension 10 (CI).

> **Release Theme**
> Turn invisible performance cliffs into tunable, observable, benchmarked
> operations. Promote thresholds to GUCs, improve GROUP_RESCAN, add WAL CDC
> tunability, implement explain diagnostics, and gate performance in CI.

---

### Features

| ID | Title | Effort | Priority | Assessment ref |
|----|-------|--------|----------|----------------|
| A44-1 | Deep-join thresholds promoted to GUCs with diagnostics | M | P0 | PERF-01 |
| A44-2 | GROUP_RESCAN improvement: old/new UPDATE split | L | P0 | PERF-02 |
| A44-3 | WAL poll batch size and lag GUCs + metrics | M | P1 | PERF-03 |
| A44-4 | Cost-cache configurable capacity + hit/miss metrics | M | P1 | PERF-04 |
| A44-5 | Pool-vs-spawn benchmark comparison and auto-enable | M | P2 | PERF-05 |
| A44-6 | Write-amplification benchmarks and docs | M | P2 | PERF-06 |
| A44-7 | Mandatory microbenchmarks in CI | M | P1 | PERF-07, CI-04 |
| A44-8 | `pgtrickle.explain_stream_table(name)` diagnostics | L | P1 | FEAT-04 |
| A44-9 | WAL CDC per-source status and tunable GUCs | M | P1 | FEAT-05 |
| A44-10 | D+I change buffer schema refactor (breaking) | XL | P0 | #728, discussion #726 |
| A44-11 | D+I benchmark suite (write path, CB scan, multi-change, UNION ALL elimination) | M | P0 | #728 |

**A44-1 — Deep-join thresholds as GUCs.**
Promote `PART3_MAX_SCAN_COUNT` and `DEEP_JOIN_L0_SCAN_THRESHOLD` from
`src/dvm/operators/join.rs` to documented GUCs with sane defaults. When a
query crosses these thresholds, emit a diagnostic event that includes the
stream table name, threshold crossed, actual value, and suggested action.
Include EXPLAIN snippets in validation output for deep-join queries.

**A44-2 — GROUP_RESCAN improvement.**
Implement the DI-2-style old/new UPDATE split for mutable aggregate arguments
in `src/dvm/operators/aggregate.rs`. For aggregates that currently fall back to
GROUP_RESCAN on UPDATE (SUM with mutable CASE, MIN/MAX candidates), split the
UPDATE into retraction of the old value and insertion of the new value, keeping
the delta processing O(affected rows) rather than O(group size). Track
per-aggregate fallback mode in diagnostics and benchmarks.

**A44-3 — WAL poll batch and lag GUCs.**
Expose `pg_trickle.wal_max_changes_per_poll` (default 10,000) and
`pg_trickle.wal_max_lag_bytes` (default 65,536) as GUCs with bounds validation.
Add metrics for poll duration, rows decoded per poll, lag bytes, and fallback
reason to Prometheus exposition.

**A44-4 — Cost-cache capacity.**
Make `COST_CACHE_CAPACITY` configurable at `shared_preload_libraries` time via
`pg_trickle.cost_cache_capacity` (default 256). Add hit/miss/collision counters
exposed through `pgtrickle.scheduler_diagnostics()` or Prometheus metrics.

**A44-5 — Pool-vs-spawn benchmark.**
Add a Criterion benchmark comparing spawn-per-task dynamic workers with
persistent pool workers for various refresh frequencies. Document the
crossover point and consider auto-enabling the pool above a DAG-size or
refresh-rate threshold.

**A44-6 — Write-amplification benchmarks.**
Add benchmark dimensions for UPDATE-heavy and keyless workloads that measure
WAL volume, index writes, and vacuum pressure. Document expected buffer growth
and recommended autovacuum settings for high-churn source tables.

**A44-7 — Mandatory microbenchmarks.**
Add focused Criterion benchmarks for:
- Join codegen (template construction time by join depth)
- Placeholder resolution (substitution and validation time)
- Scan/aggregate delta SQL generation
- Scheduler DAG rebuild (incremental vs full)

Gate these in CI with the existing Criterion regression detection. Fail PRs
that regress beyond threshold.

**A44-8 — `explain_stream_table()` diagnostics.**
Add `#[pg_extern(schema = "pgtrickle")] fn explain_stream_table(name: &str)`
that returns a JSON/table report including:
- Chosen DVM plan (operators, join strategy, aggregate mode)
- Fallback reasons (GROUP_RESCAN, deep-join snapshot, keyless explicit DML)
- Thresholds crossed and current values
- Estimated refresh complexity class
- Suggested tuning actions (GUC adjustments, query rewrites)

**A44-9 — WAL CDC per-source status.**
Extend `pgtrickle.source_status()` or add a dedicated
`pgtrickle.wal_source_status()` view that exposes per-source:
- CDC mode (trigger/WAL/transitioning)
- `blocked_reason` if not WAL-eligible
- Slot name and lag bytes
- Publication state
- Last decoder error (if any)
- Transition attempt history

**A44-10 — D+I change buffer schema refactor.**
Replace the wide-model CB table schema with a native D+I decomposition.

Current schema: every source column mirrored as `new_col` / `old_col`; UPDATE
stored as one row with both halves; `diff_scan_change_buffer()` decomposes to
D+I at read time via a 5-CTE UNION ALL pipeline (~600 lines).

Target schema: flat base columns (no prefix), `action ∈ {'I','D'}` only, UPDATE
decomposed at write time in the CDC trigger. `diff_scan_change_buffer()` becomes
a single-query scan with no UNION ALL, no column-prefix aliasing, no
`old_pk_hash_expr` recomputation, and no ST-source special-case branch.

Changes required:
- `src/cdc.rs`: CB table DDL, `build_change_buffer_columns()`,
  `alter_change_buffer_add_columns()`, `sync_change_buffer_columns()`,
  row-level trigger body, statement-level trigger body, WAL decoder write path.
  **Statement-level UPDATE trigger** (ADR): use a single
  `INSERT INTO changes_<oid> … SELECT … UNION ALL SELECT …` to emit both the
  D-row (from `__pgt_old JOIN __pgt_new`) and the I-row (from
  `__pgt_new JOIN __pgt_old`) in one executor pass, opening the CB heap
  relation once per UPDATE statement. This is strictly better than two
  separate `INSERT … SELECT` statements for large batch UPDATEs.
  *Rejected alternative:* two separate `INSERT … SELECT` statements — opens
  the CB heap relation twice per UPDATE statement; acceptable for the row-level
  trigger (which cannot batch) but avoidable for the statement-level path.
  The row-level trigger emits two sequential `VALUES` inserts (one D, one I)
  — no batching possible there. **`change_id` ordering invariant:** the D-row
  must be emitted before the I-row so that `ORDER BY change_id` in the scan
  pipeline correctly identifies the D-row as `FIRST_VALUE` and the I-row as
  `LAST_VALUE` for a lone UPDATE. PL/pgSQL executes sequential INSERT statements
  within a single trigger invocation in order, and BIGSERIAL CACHE=1 issues
  values monotonically within the transaction; this invariant holds by
  construction but must be preserved if the trigger body is ever restructured.
  Add an inline comment (`-- D-row must be emitted before I-row`) in the
  generated trigger body.
- `src/dvm/operators/scan.rs`: `diff_scan_change_buffer()` — collapse to
  single-query scan; remove `is_st_source` branch, `old_col_prefix`,
  `old_pk_hash_expr`, and 5-CTE pipeline
- `src/dvm/operators/aggregate.rs`: **P5 direct aggregate delta**
  (`generate_direct_agg_delta()`) bypasses `diff_scan_change_buffer()` and reads
  directly from the CB with `c."new_{col}"` / `c."old_{col}"` references and
  `c.action = 'U'`. Must be rewritten: reference flat `c."{col}"` via the D+I
  rows. The 'V' (value-only) net-correction optimization (A-2) is **dropped**:
  in D+I, the D-row and I-row naturally self-cancel on counts, and the aggregate
  SUM correctly computes the net delta from the two separate rows. The code
  simplification outweighs the ~2× row volume entering the GROUP BY for
  value-only UPDATEs (which is already O(delta) and small). If benchmarks prove
  otherwise, re-introduce as a post-D+I optimisation pass.
- `src/dvm/operators/recursive_cte.rs`: `generate_change_buffer_from()` and
  `generate_old_seed_from_existing()` directly reference CB columns using
  `c."new_{col}"` / `c.action IN ('I','U')` patterns. Rewrite to flat column
  names and `c.action = 'I'` / `c.action = 'D'` respectively.
- `src/wal_decoder.rs`: UPDATE → D+I decomposition at decode time.
  `build_pk_hash_from_values()` must be called with `old_parsed` for the D-row
  (OLD pk_hash) and `parsed` for the I-row (NEW pk_hash). With REPLICA IDENTITY
  DEFAULT, `old_parsed` only contains PK columns — non-PK columns in the D-row
  will be NULL. This is acceptable (scan operators only need pk_hash from D-rows
  for identity) **but only when the P5 direct aggregate path is not used** for
  WAL-sourced tables. **REPLICA IDENTITY FULL is a hard requirement** when a
  WAL-decoded source feeds a P5-eligible aggregate ST. Document this explicitly
  and emit a WARNING during `create_stream_table()` if the source uses WAL CDC
  and has REPLICA IDENTITY DEFAULT with a P5-eligible defining query.
  **Atomicity risk (new in D+I):** `write_decoded_change()` currently emits one
  `Spi::run()` per UPDATE. In D+I it must emit two (D-row, then I-row). These
  are separate SPI calls and are **not atomic**: if the background worker crashes
  or is killed between them, the D-row is committed without its I-row, leaving a
  dangling DELETE in the CB that permanently removes the pre-update row from
  every subsequent differential refresh without a compensating INSERT. The wide
  model has no equivalent risk (one call = one complete UPDATE row). **Mitigation:**
  use a single multi-row VALUES insert
  `INSERT INTO changes_<oid> (...) VALUES (...), (...)` so both rows are written
  in one SPI call and one heap operation. This is the WAL decoder equivalent of
  the statement-level trigger's UNION ALL choice.
- `src/cdc.rs` `sync_change_buffer_columns()`: the orphan-column cleanup builds
  `expected_data_cols` as `{"new_col1", "old_col1", ...}`. Post-D+I it must
  produce `{"col1", "col2", ...}`. **Migration safety**: the function must
  detect the current CB schema version before computing the expected set.
  Running the D+I-era logic against a pre-migration wide-schema buffer would
  silently drop ALL user data columns. Guard with a schema-version check
  (presence of `new_*` prefix columns → old schema, flat columns → new schema).
  During the feature-flag shadow period, both formats may coexist per-source.
- `changed_cols` WB-1 (ADR): both the D-row and I-row of an UPDATE pair carry
  the **same** VARBIT bitmask (`NULL` = genuine INSERT or DELETE, not an UPDATE
  half). The scan filter remains at read time, unchanged in structure — it tests
  `changed_cols IS NULL OR (changed_cols & mask) != zero`. This preserves full
  decoupling from view definitions (a single CB table can serve many STs each
  with a different column mask), avoids write-time coupling to the query
  definition, and requires no correlation between D and I rows. The CDC trigger
  computes the bitmask once from OLD/NEW and writes it to both rows.
  *Rejected alternatives:* (1) write-time filtering — requires the trigger to
  know the union of all downstream referenced columns, coupling trigger rebuild
  to view DDL changes; (2) bitmask on I-row only — requires D/I correlation at
  read time to suppress phantom deletes, reintroducing complexity D+I eliminates.
- Migration: rebuild all `changes_<oid>` tables and regenerate all CDC triggers
  during upgrade; discard in-flight buffer rows

This is a **breaking schema change** and must be treated as a major version
bump. Upgrade tests must cover the `changes_<oid>` rebuild path end-to-end.

**A44-11 — D+I benchmark suite.**
The existing `bench_diff_scan` in `benches/diff_operators.rs` measures only
Rust SQL-generation time (template construction), not PostgreSQL query
execution time. A44-11 adds execution benchmarks to `benches/refresh_bench.rs`
that measure the actual impact of the schema change:

1. **Write path — UPDATE throughput.**
   Criterion benchmark: `INSERT INTO changes_<oid>` rate (rows/sec) under the
   row-level trigger, row counts 1K / 10K / 100K, column widths 5 / 20 / 50.
   Expected result: +5–10% regression per UPDATE row (one extra heap insert
   + index maintenance). This establishes the precise cost of the trade-off.

2. **CB scan time — delta refresh latency.**
   E2E bench: `refresh_stream_table()` wall time with N rows in the change
   buffer, 50-column wide table, under the old wide schema vs D+I schema.
   Row counts: 1K / 10K / 100K. Expected: 20–40% improvement from halved
   table width and single-pass scan.

3. **Multi-change collapse — window function elimination.**
   E2E bench: workload where ≥50% of PKs have 2+ changes per refresh window
   (INSERT+UPDATE, UPDATE+DELETE, multiple UPDATEs). This stresses the
   `FIRST_VALUE`/`LAST_VALUE` path that D+I replaces. Expected: largest
   improvement of all scenarios.

4. **UNION ALL elimination — buffer hit reduction.**
   Capture `EXPLAIN (ANALYZE, BUFFERS)` output for `diff_scan_change_buffer()`
   before and after migration; assert that `shared_hit` count drops by ≥40%
   (reflecting elimination of the double CB scan).

All four benchmarks must run in CI as part of the existing Criterion regression
gate. The write-path regression (benchmark 1) is expected and documented; the
CI gate must be adjusted so it does not block PRs on that specific benchmark
while still catching unexpected regressions on benchmarks 2–4.

### Test Coverage

| ID | Title | Effort | Priority | Assessment ref |
|----|-------|--------|----------|----------------|
| T-A44-1 | Deep-join threshold GUC behavior tests | M | P1 | PERF-01 |
| T-A44-2 | GROUP_RESCAN improvement correctness tests | L | P0 | PERF-02 |
| T-A44-3 | explain_stream_table output accuracy | M | P1 | FEAT-04 |
| T-A44-4 | WAL per-source status accuracy | M | P1 | FEAT-05 |
| T-A44-5 | Microbenchmark CI gate validation | S | P1 | PERF-07 |
| T-A44-10 | D+I CB schema: CDC correctness (I/U/D), delta scan accuracy, upgrade path | XL | P0 | #728 |
| T-A44-11 | D+I benchmark suite: write path, CB scan, multi-change, UNION ALL | M | P0 | #728 |

**T-A44-1.**
E2E tests that set deep-join threshold GUCs to low values, create deep-join
stream tables, and verify diagnostics are emitted when thresholds are crossed.

**T-A44-2.**
E2E tests for the old/new UPDATE split: create a stream table with
`SUM(CASE ...)`, perform UPDATEs that cross the CASE boundary, and verify
correct incremental results match a full query result. Benchmark the
improvement vs GROUP_RESCAN baseline.

**T-A44-3.**
Test that `explain_stream_table` returns accurate plan information for various
query patterns: simple scan, multi-join, aggregates with fallback, deep joins,
and keyless sources.

**T-A44-4.**
Test that WAL per-source status correctly reflects trigger mode, WAL mode,
transition in progress, blocked reasons, and slot lag after various CDC
lifecycle events.

**T-A44-5.**
Verify that microbenchmark Criterion baseline is established on main push and
regression detection works on PR.

**T-A44-10.**
E2E tests covering the full CDC lifecycle under D+I schema: INSERT, UPDATE, and
DELETE correctness; net-effect idempotency (multiple UPDATEs to same row);
keyless table path; ST-source path; WAL decoder path; `changed_cols` symmetric
bitmask (D-row and I-row both carry same VARBIT, NULL for genuine INSERT/DELETE);
multi-ST fan-out (same CB served by two STs with different column masks);
**P5 direct aggregate delta** (SUM, COUNT over single-source table, UPDATE
correctness without the 'V' row); **recursive CTE** delta seeding from D+I CB;
**WAL decoder with REPLICA IDENTITY DEFAULT** (verify P5 path is disabled or
falls back); **PK-changing UPDATE** (D-row carries OLD pk_hash, I-row carries
NEW pk_hash); **`sync_change_buffer_columns()` migration guard** (verify
wide-schema CB is not corrupted by running D+I-era orphan-drop logic);
**`compact_change_buffer()` with D+I CB** (INSERT+UPDATE+DELETE chains, verify
not incorrectly discarded as no-op, verify middle-row elimination leaves correct
D₁/I₂ net pair). Upgrade path tests: existing stream table with old wide-schema
buffer successfully migrates to D+I on `ALTER EXTENSION pg_trickle UPDATE`.

**T-A44-11.**
Run all four benchmark scenarios (write path, CB scan, multi-change, UNION ALL
elimination) against both the wide-schema baseline and the D+I schema. Assert:
- Write path: regression ≤15% per UPDATE (documented and accepted trade-off)
- CB scan: improvement ≥20% for 50-column table at 10K+ buffer rows
- Multi-change: improvement ≥30% when ≥50% of PKs have 2+ changes
- UNION ALL: `shared_hit` in `EXPLAIN BUFFERS` drops ≥40%

Results must be committed as Criterion baselines and reviewed in the PR.

### Conflicts & Risks

- **A44-2** (GROUP_RESCAN improvement) is a core DVM operator change that
  affects correctness. Must be accompanied by extensive differential tests
  comparing DIFF vs FULL results.
- **A44-1** (GUC promotion) changes runtime behavior for users who previously
  relied on hardcoded defaults. Ensure defaults match current constants.
- **A44-8** (explain) must not be a maintenance burden — derive explanations
  from the same metadata the DVM uses, not separate documentation strings.
- **A44-10** (D+I refactor) is the highest-risk item: it rewrites the CB table
  schema, CDC trigger bodies, and the `diff_scan_change_buffer()` read path in
  one migration. Risk mitigation: implement behind a feature flag first; run
  shadow mode (old and new CB tables simultaneously) for one release cycle;
  gate on `just test-upgrade-all` passing before merging.
  **Feature flag CB schema version:** during the shadow period, different
  sources may have different CB formats. `diff_scan_change_buffer()` and the
  P5/recursive paths need to know which format a given CB is in. Use a
  per-source boolean or enum in `pgtrickle.pgt_stream_tables`
  (e.g. `cb_schema_version TEXT DEFAULT 'WIDE'`, set to `'DI'` after migration)
  rather than inferring the format at runtime by inspecting column names.
  Additional high-risk surface areas identified:
  - `aggregate.rs` P5 direct aggregate path reads CB columns directly (bypasses
    `diff_scan_change_buffer()`); requires full rewrite of column references and
    removal of the 'V' value-only row optimisation.
  - `recursive_cte.rs` directly references `c."new_{col}"` and `action IN ('I','U')`
    in seed generation — will emit empty results post-D+I if not updated.
  - `sync_change_buffer_columns()` orphan-drop logic will destroy wide-schema
    columns if run under D+I assumptions. Must be version-gated.
  - WAL decoder with REPLICA IDENTITY DEFAULT produces NULL non-PK columns on
    D-rows; any operator reading value columns from D-rows (e.g. P5 aggregate)
    will compute wrong results. Hard-require REPLICA IDENTITY FULL for WAL
    sources feeding P5-eligible STs.
  - WAL decoder D+I atomicity: two separate `Spi::run()` calls for D-row and
    I-row are not atomic — a crash between them leaves a dangling DELETE.
    Mitigate with a single multi-row `INSERT INTO ... VALUES (...), (...)`.
    This risk does not exist in the wide model (one SPI call per UPDATE).
- **A44-11** (benchmark suite) must be implemented *before* A44-10 merges so
  that the wide-schema baseline is captured against the same hardware. Running
  benchmarks only after the migration removes the ability to compare.
  **Baseline capture discipline:** the `EXPLAIN (ANALYZE, BUFFERS)` baseline
  and Criterion baseline must be captured against the *exact same fixture data*
  as the post-migration run. Capture the baseline as the **first step** of the
  A44-10 implementation, before any Rust code changes. Any change to the
  benchmark fixture between baseline and post-migration run invalidates the
  comparison.

### Exit Criteria

- [ ] A44-1: Deep-join thresholds are GUC-tunable and emit diagnostics
- [ ] A44-2: GROUP_RESCAN cases improved with old/new split, verified by benchmark
- [ ] A44-3: WAL batch/lag GUCs functional with metrics
- [ ] A44-4: Cost-cache capacity configurable, metrics exposed
- [ ] A44-7: Mandatory microbenchmarks gate PRs
- [ ] A44-8: `explain_stream_table` returns accurate plan information
- [ ] A44-9: WAL per-source status view functional
- [ ] A44-10: D+I CB schema live; `action` column contains only 'I'/'D'; `diff_scan_change_buffer()` contains no UNION ALL; upgrade path tested end-to-end
- [ ] A44-11: Benchmark suite implemented; baseline captured against wide schema; post-migration results committed; write-path regression within documented bounds; CB scan and multi-change improvements confirmed
- [ ] Extension upgrade path tested (`0.43.0 → 0.44.0`)
- [ ] `just lint` passes with zero warnings
- [ ] `just test-all` passes