> **Plain-language companion:** [v0.43.0.md](v0.43.0.md) ## v0.43.0 — Performance Tunability, Diagnostics & Refresh Intelligence **Status: Planned.** Derived from [plans/PLAN_OVERALL_ASSESSMENT_9.md](../plans/PLAN_OVERALL_ASSESSMENT_9.md) §Dimension 3 (Performance), §Dimension 5 (Feature Gaps), §Dimension 10 (CI). > **Release Theme** > Turn invisible performance cliffs into tunable, observable, benchmarked > operations. Promote thresholds to GUCs, improve GROUP_RESCAN, add WAL CDC > tunability, implement explain diagnostics, and gate performance in CI. --- ### Features | ID | Title | Effort | Priority | Assessment ref | |----|-------|--------|----------|----------------| | A44-1 | Deep-join thresholds promoted to GUCs with diagnostics | M | P0 | PERF-01 | | A44-2 | GROUP_RESCAN improvement: old/new UPDATE split | L | P0 | PERF-02 | | A44-3 | WAL poll batch size and lag GUCs + metrics | M | P1 | PERF-03 | | A44-4 | Cost-cache configurable capacity + hit/miss metrics | M | P1 | PERF-04 | | A44-5 | Pool-vs-spawn benchmark comparison and auto-enable | M | P2 | PERF-05 | | A44-6 | Write-amplification benchmarks and docs | M | P2 | PERF-06 | | A44-7 | Mandatory microbenchmarks in CI | M | P1 | PERF-07, CI-04 | | A44-8 | `pgtrickle.explain_stream_table(name)` diagnostics | L | P1 | FEAT-04 | | A44-9 | WAL CDC per-source status and tunable GUCs | M | P1 | FEAT-05 | | A44-10 | D+I change buffer schema refactor (breaking) | XL | P0 | #728, discussion #726 | | A44-11 | D+I benchmark suite (write path, CB scan, multi-change, UNION ALL elimination) | M | P0 | #728 | **A44-1 — Deep-join thresholds as GUCs.** Promote `PART3_MAX_SCAN_COUNT` and `DEEP_JOIN_L0_SCAN_THRESHOLD` from `src/dvm/operators/join.rs` to documented GUCs with sane defaults. When a query crosses these thresholds, emit a diagnostic event that includes the stream table name, threshold crossed, actual value, and suggested action. Include EXPLAIN snippets in validation output for deep-join queries. **A44-2 — GROUP_RESCAN improvement.** Implement the DI-2-style old/new UPDATE split for mutable aggregate arguments in `src/dvm/operators/aggregate.rs`. For aggregates that currently fall back to GROUP_RESCAN on UPDATE (SUM with mutable CASE, MIN/MAX candidates), split the UPDATE into retraction of the old value and insertion of the new value, keeping the delta processing O(affected rows) rather than O(group size). Track per-aggregate fallback mode in diagnostics and benchmarks. **A44-3 — WAL poll batch and lag GUCs.** Expose `pg_trickle.wal_max_changes_per_poll` (default 10,000) and `pg_trickle.wal_max_lag_bytes` (default 65,536) as GUCs with bounds validation. Add metrics for poll duration, rows decoded per poll, lag bytes, and fallback reason to Prometheus exposition. **A44-4 — Cost-cache capacity.** Make `COST_CACHE_CAPACITY` configurable at `shared_preload_libraries` time via `pg_trickle.cost_cache_capacity` (default 256). Add hit/miss/collision counters exposed through `pgtrickle.scheduler_diagnostics()` or Prometheus metrics. **A44-5 — Pool-vs-spawn benchmark.** Add a Criterion benchmark comparing spawn-per-task dynamic workers with persistent pool workers for various refresh frequencies. Document the crossover point and consider auto-enabling the pool above a DAG-size or refresh-rate threshold. **A44-6 — Write-amplification benchmarks.** Add benchmark dimensions for UPDATE-heavy and keyless workloads that measure WAL volume, index writes, and vacuum pressure. Document expected buffer growth and recommended autovacuum settings for high-churn source tables. **A44-7 — Mandatory microbenchmarks.** Add focused Criterion benchmarks for: - Join codegen (template construction time by join depth) - Placeholder resolution (substitution and validation time) - Scan/aggregate delta SQL generation - Scheduler DAG rebuild (incremental vs full) Gate these in CI with the existing Criterion regression detection. Fail PRs that regress beyond threshold. **A44-8 — `explain_stream_table()` diagnostics.** Add `#[pg_extern(schema = "pgtrickle")] fn explain_stream_table(name: &str)` that returns a JSON/table report including: - Chosen DVM plan (operators, join strategy, aggregate mode) - Fallback reasons (GROUP_RESCAN, deep-join snapshot, keyless explicit DML) - Thresholds crossed and current values - Estimated refresh complexity class - Suggested tuning actions (GUC adjustments, query rewrites) **A44-9 — WAL CDC per-source status.** Extend `pgtrickle.source_status()` or add a dedicated `pgtrickle.wal_source_status()` view that exposes per-source: - CDC mode (trigger/WAL/transitioning) - `blocked_reason` if not WAL-eligible - Slot name and lag bytes - Publication state - Last decoder error (if any) - Transition attempt history **A44-10 — D+I change buffer schema refactor.** Replace the wide-model CB table schema with a native D+I decomposition. Current schema: every source column mirrored as `new_col` / `old_col`; UPDATE stored as one row with both halves; `diff_scan_change_buffer()` decomposes to D+I at read time via a 5-CTE UNION ALL pipeline (~600 lines). Target schema: flat base columns (no prefix), `action ∈ {'I','D'}` only, UPDATE decomposed at write time in the CDC trigger. `diff_scan_change_buffer()` becomes a single-query scan with no UNION ALL, no column-prefix aliasing, no `old_pk_hash_expr` recomputation, and no ST-source special-case branch. Changes required: - `src/cdc.rs`: CB table DDL, `build_change_buffer_columns()`, `alter_change_buffer_add_columns()`, `sync_change_buffer_columns()`, row-level trigger body, statement-level trigger body, WAL decoder write path. **Statement-level UPDATE trigger** (ADR): use a single `INSERT INTO changes_ … SELECT … UNION ALL SELECT …` to emit both the D-row (from `__pgt_old JOIN __pgt_new`) and the I-row (from `__pgt_new JOIN __pgt_old`) in one executor pass, opening the CB heap relation once per UPDATE statement. This is strictly better than two separate `INSERT … SELECT` statements for large batch UPDATEs. *Rejected alternative:* two separate `INSERT … SELECT` statements — opens the CB heap relation twice per UPDATE statement; acceptable for the row-level trigger (which cannot batch) but avoidable for the statement-level path. The row-level trigger emits two sequential `VALUES` inserts (one D, one I) — no batching possible there. **`change_id` ordering invariant:** the D-row must be emitted before the I-row so that `ORDER BY change_id` in the scan pipeline correctly identifies the D-row as `FIRST_VALUE` and the I-row as `LAST_VALUE` for a lone UPDATE. PL/pgSQL executes sequential INSERT statements within a single trigger invocation in order, and BIGSERIAL CACHE=1 issues values monotonically within the transaction; this invariant holds by construction but must be preserved if the trigger body is ever restructured. Add an inline comment (`-- D-row must be emitted before I-row`) in the generated trigger body. - `src/dvm/operators/scan.rs`: `diff_scan_change_buffer()` — collapse to single-query scan; remove `is_st_source` branch, `old_col_prefix`, `old_pk_hash_expr`, and 5-CTE pipeline - `src/dvm/operators/aggregate.rs`: **P5 direct aggregate delta** (`generate_direct_agg_delta()`) bypasses `diff_scan_change_buffer()` and reads directly from the CB with `c."new_{col}"` / `c."old_{col}"` references and `c.action = 'U'`. Must be rewritten: reference flat `c."{col}"` via the D+I rows. The 'V' (value-only) net-correction optimization (A-2) is **dropped**: in D+I, the D-row and I-row naturally self-cancel on counts, and the aggregate SUM correctly computes the net delta from the two separate rows. The code simplification outweighs the ~2× row volume entering the GROUP BY for value-only UPDATEs (which is already O(delta) and small). If benchmarks prove otherwise, re-introduce as a post-D+I optimisation pass. - `src/dvm/operators/recursive_cte.rs`: `generate_change_buffer_from()` and `generate_old_seed_from_existing()` directly reference CB columns using `c."new_{col}"` / `c.action IN ('I','U')` patterns. Rewrite to flat column names and `c.action = 'I'` / `c.action = 'D'` respectively. - `src/wal_decoder.rs`: UPDATE → D+I decomposition at decode time. `build_pk_hash_from_values()` must be called with `old_parsed` for the D-row (OLD pk_hash) and `parsed` for the I-row (NEW pk_hash). With REPLICA IDENTITY DEFAULT, `old_parsed` only contains PK columns — non-PK columns in the D-row will be NULL. This is acceptable (scan operators only need pk_hash from D-rows for identity) **but only when the P5 direct aggregate path is not used** for WAL-sourced tables. **REPLICA IDENTITY FULL is a hard requirement** when a WAL-decoded source feeds a P5-eligible aggregate ST. Document this explicitly and emit a WARNING during `create_stream_table()` if the source uses WAL CDC and has REPLICA IDENTITY DEFAULT with a P5-eligible defining query. **Atomicity risk (new in D+I):** `write_decoded_change()` currently emits one `Spi::run()` per UPDATE. In D+I it must emit two (D-row, then I-row). These are separate SPI calls and are **not atomic**: if the background worker crashes or is killed between them, the D-row is committed without its I-row, leaving a dangling DELETE in the CB that permanently removes the pre-update row from every subsequent differential refresh without a compensating INSERT. The wide model has no equivalent risk (one call = one complete UPDATE row). **Mitigation:** use a single multi-row VALUES insert `INSERT INTO changes_ (...) VALUES (...), (...)` so both rows are written in one SPI call and one heap operation. This is the WAL decoder equivalent of the statement-level trigger's UNION ALL choice. - `src/cdc.rs` `sync_change_buffer_columns()`: the orphan-column cleanup builds `expected_data_cols` as `{"new_col1", "old_col1", ...}`. Post-D+I it must produce `{"col1", "col2", ...}`. **Migration safety**: the function must detect the current CB schema version before computing the expected set. Running the D+I-era logic against a pre-migration wide-schema buffer would silently drop ALL user data columns. Guard with a schema-version check (presence of `new_*` prefix columns → old schema, flat columns → new schema). During the feature-flag shadow period, both formats may coexist per-source. - `changed_cols` WB-1 (ADR): both the D-row and I-row of an UPDATE pair carry the **same** VARBIT bitmask (`NULL` = genuine INSERT or DELETE, not an UPDATE half). The scan filter remains at read time, unchanged in structure — it tests `changed_cols IS NULL OR (changed_cols & mask) != zero`. This preserves full decoupling from view definitions (a single CB table can serve many STs each with a different column mask), avoids write-time coupling to the query definition, and requires no correlation between D and I rows. The CDC trigger computes the bitmask once from OLD/NEW and writes it to both rows. *Rejected alternatives:* (1) write-time filtering — requires the trigger to know the union of all downstream referenced columns, coupling trigger rebuild to view DDL changes; (2) bitmask on I-row only — requires D/I correlation at read time to suppress phantom deletes, reintroducing complexity D+I eliminates. - Migration: rebuild all `changes_` tables and regenerate all CDC triggers during upgrade; discard in-flight buffer rows This is a **breaking schema change** and must be treated as a major version bump. Upgrade tests must cover the `changes_` rebuild path end-to-end. **A44-11 — D+I benchmark suite.** The existing `bench_diff_scan` in `benches/diff_operators.rs` measures only Rust SQL-generation time (template construction), not PostgreSQL query execution time. A44-11 adds execution benchmarks to `benches/refresh_bench.rs` that measure the actual impact of the schema change: 1. **Write path — UPDATE throughput.** Criterion benchmark: `INSERT INTO changes_` rate (rows/sec) under the row-level trigger, row counts 1K / 10K / 100K, column widths 5 / 20 / 50. Expected result: +5–10% regression per UPDATE row (one extra heap insert + index maintenance). This establishes the precise cost of the trade-off. 2. **CB scan time — delta refresh latency.** E2E bench: `refresh_stream_table()` wall time with N rows in the change buffer, 50-column wide table, under the old wide schema vs D+I schema. Row counts: 1K / 10K / 100K. Expected: 20–40% improvement from halved table width and single-pass scan. 3. **Multi-change collapse — window function elimination.** E2E bench: workload where ≥50% of PKs have 2+ changes per refresh window (INSERT+UPDATE, UPDATE+DELETE, multiple UPDATEs). This stresses the `FIRST_VALUE`/`LAST_VALUE` path that D+I replaces. Expected: largest improvement of all scenarios. 4. **UNION ALL elimination — buffer hit reduction.** Capture `EXPLAIN (ANALYZE, BUFFERS)` output for `diff_scan_change_buffer()` before and after migration; assert that `shared_hit` count drops by ≥40% (reflecting elimination of the double CB scan). All four benchmarks must run in CI as part of the existing Criterion regression gate. The write-path regression (benchmark 1) is expected and documented; the CI gate must be adjusted so it does not block PRs on that specific benchmark while still catching unexpected regressions on benchmarks 2–4. ### Test Coverage | ID | Title | Effort | Priority | Assessment ref | |----|-------|--------|----------|----------------| | T-A44-1 | Deep-join threshold GUC behavior tests | M | P1 | PERF-01 | | T-A44-2 | GROUP_RESCAN improvement correctness tests | L | P0 | PERF-02 | | T-A44-3 | explain_stream_table output accuracy | M | P1 | FEAT-04 | | T-A44-4 | WAL per-source status accuracy | M | P1 | FEAT-05 | | T-A44-5 | Microbenchmark CI gate validation | S | P1 | PERF-07 | | T-A44-10 | D+I CB schema: CDC correctness (I/U/D), delta scan accuracy, upgrade path | XL | P0 | #728 | | T-A44-11 | D+I benchmark suite: write path, CB scan, multi-change, UNION ALL | M | P0 | #728 | **T-A44-1.** E2E tests that set deep-join threshold GUCs to low values, create deep-join stream tables, and verify diagnostics are emitted when thresholds are crossed. **T-A44-2.** E2E tests for the old/new UPDATE split: create a stream table with `SUM(CASE ...)`, perform UPDATEs that cross the CASE boundary, and verify correct incremental results match a full query result. Benchmark the improvement vs GROUP_RESCAN baseline. **T-A44-3.** Test that `explain_stream_table` returns accurate plan information for various query patterns: simple scan, multi-join, aggregates with fallback, deep joins, and keyless sources. **T-A44-4.** Test that WAL per-source status correctly reflects trigger mode, WAL mode, transition in progress, blocked reasons, and slot lag after various CDC lifecycle events. **T-A44-5.** Verify that microbenchmark Criterion baseline is established on main push and regression detection works on PR. **T-A44-10.** E2E tests covering the full CDC lifecycle under D+I schema: INSERT, UPDATE, and DELETE correctness; net-effect idempotency (multiple UPDATEs to same row); keyless table path; ST-source path; WAL decoder path; `changed_cols` symmetric bitmask (D-row and I-row both carry same VARBIT, NULL for genuine INSERT/DELETE); multi-ST fan-out (same CB served by two STs with different column masks); **P5 direct aggregate delta** (SUM, COUNT over single-source table, UPDATE correctness without the 'V' row); **recursive CTE** delta seeding from D+I CB; **WAL decoder with REPLICA IDENTITY DEFAULT** (verify P5 path is disabled or falls back); **PK-changing UPDATE** (D-row carries OLD pk_hash, I-row carries NEW pk_hash); **`sync_change_buffer_columns()` migration guard** (verify wide-schema CB is not corrupted by running D+I-era orphan-drop logic); **`compact_change_buffer()` with D+I CB** (INSERT+UPDATE+DELETE chains, verify not incorrectly discarded as no-op, verify middle-row elimination leaves correct D₁/I₂ net pair). Upgrade path tests: existing stream table with old wide-schema buffer successfully migrates to D+I on `ALTER EXTENSION pg_trickle UPDATE`. **T-A44-11.** Run all four benchmark scenarios (write path, CB scan, multi-change, UNION ALL elimination) against both the wide-schema baseline and the D+I schema. Assert: - Write path: regression ≤15% per UPDATE (documented and accepted trade-off) - CB scan: improvement ≥20% for 50-column table at 10K+ buffer rows - Multi-change: improvement ≥30% when ≥50% of PKs have 2+ changes - UNION ALL: `shared_hit` in `EXPLAIN BUFFERS` drops ≥40% Results must be committed as Criterion baselines and reviewed in the PR. ### Conflicts & Risks - **A44-2** (GROUP_RESCAN improvement) is a core DVM operator change that affects correctness. Must be accompanied by extensive differential tests comparing DIFF vs FULL results. - **A44-1** (GUC promotion) changes runtime behavior for users who previously relied on hardcoded defaults. Ensure defaults match current constants. - **A44-8** (explain) must not be a maintenance burden — derive explanations from the same metadata the DVM uses, not separate documentation strings. - **A44-10** (D+I refactor) is the highest-risk item: it rewrites the CB table schema, CDC trigger bodies, and the `diff_scan_change_buffer()` read path in one migration. Risk mitigation: implement behind a feature flag first; run shadow mode (old and new CB tables simultaneously) for one release cycle; gate on `just test-upgrade-all` passing before merging. **Feature flag CB schema version:** during the shadow period, different sources may have different CB formats. `diff_scan_change_buffer()` and the P5/recursive paths need to know which format a given CB is in. Use a per-source boolean or enum in `pgtrickle.pgt_stream_tables` (e.g. `cb_schema_version TEXT DEFAULT 'WIDE'`, set to `'DI'` after migration) rather than inferring the format at runtime by inspecting column names. Additional high-risk surface areas identified: - `aggregate.rs` P5 direct aggregate path reads CB columns directly (bypasses `diff_scan_change_buffer()`); requires full rewrite of column references and removal of the 'V' value-only row optimisation. - `recursive_cte.rs` directly references `c."new_{col}"` and `action IN ('I','U')` in seed generation — will emit empty results post-D+I if not updated. - `sync_change_buffer_columns()` orphan-drop logic will destroy wide-schema columns if run under D+I assumptions. Must be version-gated. - WAL decoder with REPLICA IDENTITY DEFAULT produces NULL non-PK columns on D-rows; any operator reading value columns from D-rows (e.g. P5 aggregate) will compute wrong results. Hard-require REPLICA IDENTITY FULL for WAL sources feeding P5-eligible STs. - WAL decoder D+I atomicity: two separate `Spi::run()` calls for D-row and I-row are not atomic — a crash between them leaves a dangling DELETE. Mitigate with a single multi-row `INSERT INTO ... VALUES (...), (...)`. This risk does not exist in the wide model (one SPI call per UPDATE). - **A44-11** (benchmark suite) must be implemented *before* A44-10 merges so that the wide-schema baseline is captured against the same hardware. Running benchmarks only after the migration removes the ability to compare. **Baseline capture discipline:** the `EXPLAIN (ANALYZE, BUFFERS)` baseline and Criterion baseline must be captured against the *exact same fixture data* as the post-migration run. Capture the baseline as the **first step** of the A44-10 implementation, before any Rust code changes. Any change to the benchmark fixture between baseline and post-migration run invalidates the comparison. ### Exit Criteria - [ ] A44-1: Deep-join thresholds are GUC-tunable and emit diagnostics - [ ] A44-2: GROUP_RESCAN cases improved with old/new split, verified by benchmark - [ ] A44-3: WAL batch/lag GUCs functional with metrics - [ ] A44-4: Cost-cache capacity configurable, metrics exposed - [ ] A44-7: Mandatory microbenchmarks gate PRs - [ ] A44-8: `explain_stream_table` returns accurate plan information - [ ] A44-9: WAL per-source status view functional - [ ] A44-10: D+I CB schema live; `action` column contains only 'I'/'D'; `diff_scan_change_buffer()` contains no UNION ALL; upgrade path tested end-to-end - [ ] A44-11: Benchmark suite implemented; baseline captured against wide schema; post-migration results committed; write-path regression within documented bounds; CB scan and multi-change improvements confirmed - [ ] Extension upgrade path tested (`0.43.0 → 0.44.0`) - [ ] `just lint` passes with zero warnings - [ ] `just test-all` passes