# PLAN_PERFORMANCE_PART_9.md — Strategic Performance Roadmap **Date:** 2026-02-26 **Status:** Planning **Baseline:** PLAN_PERFORMANCE_PART_8.md results (2026-02-22) **Scope:** Higher-level performance strategy, architectural improvements, and benchmark infrastructure enhancements. --- ## Table of Contents 1. [Current State Assessment](#1-current-state-assessment) 2. [Benchmark Results (2026-02-26 Criterion Run)](#2-benchmark-results-2026-02-26-criterion-run) 3. [Higher-Level Performance Decisions](#3-higher-level-performance-decisions) 4. [Phase A: Remaining Regressions & Quick Wins](#4-phase-a-remaining-regressions--quick-wins) 5. [Phase B: CDC Architecture — Statement-Level Triggers](#5-phase-b-cdc-architecture--statement-level-triggers) 6. [Phase C: Parallel Refresh Implementation](#6-phase-c-parallel-refresh-implementation) 7. [Phase D: MERGE Alternatives & Planner Control](#7-phase-d-merge-alternatives--planner-control) 8. [Phase E: Memory and I/O Budget Management](#8-phase-e-memory-and-io-budget-management) 9. [Benchmark Infrastructure Improvements](#9-benchmark-infrastructure-improvements) 10. [Implementation Priority & Schedule](#10-implementation-priority--schedule) --- ## 1. Current State Assessment ### 1.1 Where We Are After 8 rounds of optimization, the core pipeline overhead (Decision + Gen+Build) is **under 1ms** for all scenarios. Delta SQL generation runs in **10–55 µs** (Criterion benchmarks). The system is fast at what it controls. **The bottleneck is now entirely PostgreSQL MERGE execution** — 70–97% of refresh time. Further Rust-side optimization yields diminishing returns. ### 1.2 Key Metrics (Part 8 Baseline) | Scenario | 100K/1% INCR ms | Speedup vs FULL | Status | |----------|:---------------:|:---------------:|--------| | scan | 4.6 | 65.9x | Excellent | | filter | 3.4 | 39.5x | Excellent | | aggregate | 2.5 | 6.8x | Good | | join | 18.0 | 16.3x | Regressed from Part 6 (12.3ms) | | join_agg | 14.6 | 1.9x | Regressed from Part 6 (8.5ms) | ### 1.3 Problem Areas | Issue | Impact | Root Cause | |-------|--------|-----------| | Join regression (+46%) | 18ms vs 12.3ms at 100K/1% | Prepared statement revert; 11-CTE join delta requires re-planning every cycle | | join_agg regression (+72%) | 14.6ms vs 8.5ms at 100K/1% | Same as join + aggregate overhead compounding | | join_agg 100K/10% = 0.3x FULL | INCR slower than FULL | Adaptive threshold not triggering correctly | | scan 100K/50% = 0.7x FULL | INCR slower than FULL | MERGE overhead > TRUNCATE+INSERT at high change rates | | No-data latency 9.73ms | Borderline target (<10ms) | Additional SPI calls vs Part 6 | | P95 spikes up to 7.6x median | Unpredictable latency | PostgreSQL plan cache invalidation | | diff_operators bench broken | Cannot run locally | pgrx symbols require PG library linking | --- ## 2. Benchmark Results (2026-02-26 Criterion Run) ### refresh_bench (Pure Rust, no DB) These benchmarks ran successfully on macOS. Key results: | Benchmark | Median | Change vs Previous | Notes | |-----------|--------|-------------------|-------| | quote_ident/simple | 93.9 ns | +5.6% | Minor regression — likely measurement noise | | col_list/10 | 1.37 µs | +5.8% | Minor regression | | col_list/50 | 6.94 µs | -10.9% | Improved | | prefixed_col_list/5 | 883.8 ns | -5.6% | Improved | | prefixed_col_list/20 | 4.00 µs | +34.2% | **Significant regression** — investigate | | expr_to_sql/simple_column | 26.3 ns | No change | Stable | | expr_to_sql/func_call | 211.4 ns | +12.7% | Moderate regression | | lsn_gt/0/0_vs_0/1 | 78.9 ns | +21.9% | **Regression** — string comparison overhead | | frontier_json/serialize/1 | 117.4 ns | +5.3% | Minor | | frontier_json/serialize/20 | 3.76 µs | -10.2% | Improved | | dag/build_linear_chain/10 | 6.52 µs | +4.5% | Minor | | dag/build_linear_chain/100 | 62.5 µs | No change | Stable | | xxh64/16B | 4.79 ns | +8.2% | Minor | | xxh64/4096B | 384.1 ns | +11.1% | Moderate regression | | canonical_period_60s | 666 ps | No change | Trivial | ### diff_operators (Requires pgrx/PG linking) **Could not run locally.** The `diff_operators` benchmark links against `pg_trickle` library code that references pgrx symbols (`SPI_processed`, etc.). These symbols are only available inside a PostgreSQL backend process or when linked against libpq/PG server libraries. Previous results (from Part 8, likely run inside Docker or with PG installed): | Benchmark | Time (µs) | |-----------|-----------| | diff_scan/3cols | 9.9 | | diff_scan/20cols | 47.7 | | diff_filter | 11.3 | | diff_aggregate/sum_count_avg | 21.7 | | diff_inner_join | 30.7 | | diff_left_join | 26.8 | | diff_union_all/10_children | 105.9 | | diff_window_row_number | 17.1 | | diff_join_aggregate | 54.2 | ### Assessment 1. **Rust-side operations are sub-microsecond to low-microsecond** — not a bottleneck. Even `diff_join_aggregate` at 54µs is negligible compared to the 14.6ms MERGE execution. 2. The `prefixed_col_list/20` regression (+34%) and `lsn_gt` regression (+22%) warrant investigation — these are called frequently during delta SQL generation and frontier comparison. 3. The high outlier rates (10–19% of measurements) across many benchmarks suggest **system noise** — the benchmarks should be run with higher sample counts or on a quieter system for reliable comparisons. --- ## 3. Higher-Level Performance Decisions These are architectural and strategic decisions that fundamentally affect pg_trickle's performance ceiling. Each is more impactful than any individual code optimization. ### D1: Where Does Time Actually Go? (The 80/20 Rule) The Part 8 per-phase breakdown reveals: | Phase | Time at 100K/1% | % of Total | |-------|:----------------:|:----------:| | Decision (SPI: change count, threshold) | 0.5–1.5ms | 5–15% | | Gen+Build (Rust: delta SQL) | 0.05–0.14ms | <1% | | **MERGE (PG: execute SQL)** | **1–168ms** | **70–97%** | | Cleanup (SPI: change buffer) | 0.1–7.7ms | 3–15% | **Decision:** Future optimization effort should focus on (a) making the MERGE SQL faster or (b) avoiding it entirely, not on making the Rust pipeline faster. ### D2: When to Give Up on Incremental At high change rates (10–50%), INCREMENTAL refresh can be **slower** than FULL. The current adaptive threshold mechanism (auto_threshold) should detect this and switch, but Part 8 identified that it wasn't triggering correctly for several scenarios. **Decision:** The adaptive fallback is the single most important performance feature. Priorities: 1. **Verify the Part 8 fix** (A-3: `last_full_ms` initialization) actually resolves the join_agg 100K/10% = 0.3x regression 2. **Lower the default auto_threshold** from 0.15 to 0.10 — this triggers FULL fallback earlier when costs are close 3. **Track and expose the threshold** (per GAP_SQL_PHASE_7 §F27) so operators can debug strategy decisions ### D3: Row-Level vs Statement-Level CDC Triggers The current row-level AFTER trigger fires once per affected row, inserting into the change buffer each time. For bulk DML (e.g., `UPDATE src SET x = x+1 WHERE region = 'north'` hitting 20K rows), this means 20K individual trigger invocations + 20K change buffer INSERTs + 20K index updates + 20K WAL records. PostgreSQL supports **statement-level triggers with transition tables** (PG 10+): ```sql CREATE TRIGGER cdc_stmt AFTER INSERT OR UPDATE OR DELETE ON src REFERENCING NEW TABLE AS new_rows OLD TABLE AS old_rows FOR EACH STATEMENT EXECUTE FUNCTION pg_trickle_cdc_stmt_fn(); ``` **Impact:** A single trigger invocation per statement, processing all affected rows in one batch `INSERT INTO changes_ SELECT ... FROM new_rows`. This eliminates per-row trigger overhead (PL/pgSQL entry/exit, WAL per trigger, sequence increment per row). **Expected improvement:** 50–80% reduction in write-side overhead for bulk DML. No change for single-row DML (which already fires once). **Trade-offs:** - Statement-level triggers can't access per-row `TG_OP` — all rows in a transition table share the same operation kind - Need to handle the case where both `new_rows` and `old_rows` are present (UPDATE) vs only one (INSERT/DELETE) - TRUNCATE has no transition tables — continue using existing statement-level TRUNCATE trigger **Decision:** This is the highest-ROI write-side optimization. Plan in Phase B. ### D4: Parallel Refresh Dispatch Per REPORT_PARALLELIZATION.md, the scheduler processes STs sequentially even though independent DAG branches could run in parallel. The GUC `pg_trickle.max_concurrent_refreshes` exists but is not enforced for parallel dispatch. **Impact:** With 50 STs averaging 200ms each, sequential processing takes 10s. With 4 parallel workers and a balanced DAG, this drops to ~2.5s. **Decision:** Implement level-parallel DAG dispatch (Option A+B from the parallelization report). This is a high-value feature for multi-ST deployments and makes the existing GUC operational. ### D5: UNLOGGED Change Buffers Change buffer writes generate WAL for durability. If the extension can always recover from a crash by re-running a full refresh (which it already does on crash detection), the change buffer doesn't need WAL protection. **Impact:** ~30% reduction in per-row trigger overhead (eliminates WAL writes for trigger INSERTs). At 5µs/row trigger overhead, saving 1.5µs/row is significant at scale. **Trade-offs:** - Change buffer data lost on crash — requires full refresh to recover - Already the default recovery mode (crash → reinitialize) - Cloud PG providers may not support UNLOGGED tables on replicas **Decision:** Implement behind a GUC (`pg_trickle.change_buffer_unlogged`, default false). Benchmark with the trigger overhead suite first. ### D6: Column Pruning in Change Buffers The current trigger captures ALL source table columns in the change buffer. If the stream table's defining query only references 3 of 20 source columns, the trigger still writes all 20 `new_*` and 20 `old_*` columns. **Impact:** For wide tables (20+ columns), trigger overhead scales linearly with column count (estimated from PLAN_TRIGGERS_OVERHEAD.md: 3x overhead for 20 cols vs 3 cols for UPDATE). Pruning to only referenced columns could reduce trigger cost by 50–80% for wide tables. **Trade-offs:** - Requires knowing which columns the stream table uses at trigger creation time (available from `pgt_dependencies.columns_used`) - Adding a new stream table referencing additional columns would require trigger regeneration - Columns used by a defining query may include join keys, filter predicates, and GROUP BY keys — not just SELECT columns **Decision:** High-value for wide tables. Implement after statement-level triggers (D3) since both modify the CDC architecture. ### D7: Prepared Statement Strategy Part 8 Session 5 implemented `PREPARE`/`EXECUTE` for MERGE statements. PostgreSQL switches from custom → generic plan after ~5 executions. This saves 1–2ms parse/plan time per refresh on cache hits. The Part 8 analysis noted that the prepared statement revert (undone in P1+P2) is the root cause of the join regression. The join delta query has 11 CTEs — planning this from scratch every cycle is expensive. **Current state:** `pg_trickle.use_prepared_statements` GUC exists (default true) and is implemented. Need to verify it's working correctly and that the join regression recovers when enabled. **Decision:** Verify prepared statements are active in the benchmark harness. If the join regression persists with prepared statements enabled, the issue is elsewhere. ### D8: PostgreSQL Parallel Query for Large Refreshes Per REPORT_PARALLELIZATION.md Option E, PostgreSQL's parallel query engine can parallelize individual delta SQL / MERGE statements. This is free — it requires zero code changes, only correct GUC settings. **Decision:** Verify with `EXPLAIN ANALYZE` that parallel plans are being chosen for 100K-row refreshes. Document recommended PG settings for pg_trickle deployments: ```ini max_parallel_workers_per_gather = 2 -- or 4 for large tables min_parallel_table_scan_size = 1MB -- lower threshold for stream tables ``` --- ## 4. Phase A: Remaining Regressions & Quick Wins ### A-1: Verify Adaptive Threshold Fix (Part 8 A-3) **Status:** ✅ Verified. D-2 aggregate saturation now correctly triggers FULL fallback when changes >= groups (confirmed via container logs with `log_min_messages=info`). Result: join_agg 100K/10% improved from 0.3x to **0.7x**. The remaining gap vs pure FULL mode (~14ms) is inherent decision-path overhead (frontier check, change counting, D-2 group count query) before falling back to `execute_full_refresh`. **Bugs fixed during verification:** - D-2 `reltuples` fallback: `pg_class.reltuples = -1` for unanalyzed tables caused `GREATEST(-1, 0) = 0`, disabling the check entirely. Fixed with `CASE WHEN reltuples >= 1 ... ELSE COUNT(*)` fallback. - D-3 SQL: `ORDER BY refresh_id DESC LIMIT 10` was inside an aggregate `SELECT AVG(...)`, causing "column must appear in GROUP BY" error. Fixed by wrapping in a proper subselect. **Benchmark:** `bench_join_agg_100k_10pct` — FULL avg 43.6ms, INCR avg 59.6ms (0.7x), D-2 triggers on every cycle. ### A-2: Verify Prepared Statements Recover Join Regression **Status:** ✅ Verified. Prepared statements are working — cycle 1 (cold, custom plan) = 40.6ms vs cycles 2+ (generic plan locked) = 32.2ms avg, a 21% warm-up improvement. Speedup vs FULL: **12.9x** (FULL avg 425ms). Absolute ms are higher than Part 8 baselines (32ms vs 18ms) due to Docker-on-macOS overhead (~1.8x); the speedup ratio (12.9x vs Part 8's 16.3x) is comparable. Created `bench_join_100k_1pct` test for future tracking. **Benchmark:** `bench_join_100k_1pct` — INCR c1=40.6ms, c2+=32.2ms avg, med=32.1ms, P95=37.5ms. ### A-3: Investigate prefixed_col_list/20 Regression (+34%) The Criterion benchmark shows `prefixed_col_list(prefix, &cols)` with 20 columns regressed from ~3.0µs to ~4.0µs. While not impactful at absolute scale, this function is called per column per scan in delta SQL generation. **Action:** Profile `prefixed_col_list` for unnecessary allocations. Consider pre-allocating the output string with capacity. **Effort:** 1 hour. ### A-4: Investigate lsn_gt Regression (+22%) `lsn_gt("0/0", "0/1")` regressed from ~65ns to ~79ns. This is called during frontier comparison and change buffer queries. **Action:** Check if the `lsn_gt` implementation uses string parsing. Consider caching parsed LSN values as `(u32, u32)` pairs. **Effort:** 1 hour. --- ## 5. Phase B: CDC Architecture — Statement-Level Triggers > **Status: B1 ✅ Done · B2 ✅ Done · B3 ✅ Done** This is the highest-ROI write-side optimization identified across all performance plans. ### B-1: Statement-Level Trigger Implementation ✅ Done Replace the per-row AFTER trigger with a statement-level trigger using transition tables. **Implemented in v0.4.0:** - `build_stmt_trigger_fn_sql()` in `src/cdc.rs` generates the PL/pgSQL trigger function using `INSERT … SELECT FROM __pgt_new/old`. - `create_change_trigger()` dispatches on `config::pg_trickle_cdc_trigger_mode()` and creates either: - Statement: `REFERENCING NEW TABLE AS __pgt_new OLD TABLE AS __pgt_old FOR EACH STATEMENT` - Row (legacy): `FOR EACH ROW` - **Keyless table UPDATE** handled as DELETE from `__pgt_old` + INSERT from `__pgt_new` (no PK join possible). - Trigger function name: `pg_trickle_cdc_fn_{oid}` in both modes (same name allows `CREATE OR REPLACE` to switch bodies without touching the DDL). Final DDL (statement mode): ```sql CREATE TRIGGER pg_trickle_cdc_{oid} AFTER INSERT OR UPDATE OR DELETE ON {schema}.{table} REFERENCING NEW TABLE AS __pgt_new OLD TABLE AS __pgt_old FOR EACH STATEMENT EXECUTE FUNCTION pgtrickle_changes.pg_trickle_cdc_fn_{oid}(); ``` ### B-2: Backward Compatibility & Migration ✅ Done - `pg_trickle.cdc_trigger_mode = 'statement'|'row'` GUC (default `'statement'`) in `src/config.rs` (`CdcTriggerMode` enum + `PGS_CDC_TRIGGER_MODE` static). - `rebuild_cdc_trigger()` in `src/cdc.rs`: drops + recreates the trigger DDL using the current GUC mode (fn body + trigger DDL). - `pgtrickle.rebuild_cdc_triggers()` in `src/api.rs`: iterates all source tables and calls `rebuild_cdc_trigger()` for each. - `sql/pg_trickle--0.3.0--0.4.0.sql`: declares the compiled function and calls it to migrate existing row-level triggers on `ALTER EXTENSION UPDATE`. ### B-3: Benchmark Write-Side Improvement ✅ Done Implemented in `tests/e2e_bench_tests.rs` as two `#[ignore]` test functions: - **`bench_stmt_vs_row_cdc_matrix`** — full 3×3×2 matrix: narrow/medium/wide × bulk INSERT/UPDATE + single INSERT × row/statement mode. Prints a comparison table with avg/p50/p95 timings and per-cell speedup ratios. Expected runtime: 15–20 min. - **`bench_stmt_vs_row_cdc_quick`** — narrow table, bulk INSERT only, 5 iterations. Completes in ~60 s; useful as a smoke-check. Both benchmarks use `alter_system_set_and_wait()` to switch CDC trigger mode between `row` and `statement` between runs. The output includes per-width bulk-DML speedup percentages and a single-row neutrality check (±10% target). Run explicitly: ```bash cargo test --test e2e_bench_tests -- --ignored bench_stmt_vs_row_cdc_matrix --nocapture cargo test --test e2e_bench_tests -- --ignored bench_stmt_vs_row_cdc_quick --nocapture ``` **Expected outcome:** 50–80% overhead reduction for bulk DML; neutral (±10%) for single-row DML. Results will be documented in PLAN_PERFORMANCE_PART_10.md once the benchmarks have been run. --- ## 6. Phase C: Parallel Refresh Implementation Per REPORT_PARALLELIZATION.md, implementing Option A+B. ### C-1: DAG Level Extraction Modify `StDag::topological_order()` to return `Vec>` (levels) instead of `Vec`. Trivial modification to Kahn's algorithm. **Effort:** 2–4 hours. ### C-2: Dynamic Background Worker Dispatch For each level, spawn up to `max_concurrent_refreshes` dynamic background workers. Each worker: 1. Receives `pgt_id` via datum 2. Acquires advisory lock (or `FOR UPDATE SKIP LOCKED` per GAP_SQL_PHASE_7 Q7) 3. Executes `refresh_stream_table()` 4. Writes result to `pgt_refresh_history` 5. Exits The coordinator waits for all workers in a level to complete before advancing to the next level. **Effort:** 12–16 hours. ### C-3: Result Communication Workers write refresh outcomes to `pgt_refresh_history` (already done by `refresh_stream_table`). The coordinator reads these after worker completion to update retry state and scheduling decisions. **Effort:** 3–4 hours. --- ## 7. Phase D: MERGE Alternatives & Planner Control ### D-1: Hash-Based Change Detection for Wide Tables Per GAP_SQL_PHASE_7 §G4.6, the MERGE `WHEN MATCHED` arm uses per-column `IS DISTINCT FROM` chains. For 100+ columns, this causes plan cache bloat. **Fix:** For tables with > 50 columns, replace the per-column chain with a single hash comparison: ```sql WHEN MATCHED AND pg_trickle.pg_trickle_hash( row_to_json(st.*)::text ) IS DISTINCT FROM pg_trickle.pg_trickle_hash( row_to_json(d.*)::text ) THEN UPDATE SET ... ``` **Effort:** 4–6 hours. ### D-2: Conditional FULL Bypass for Aggregates When all groups in an aggregate query are affected (common at high change rates with few groups), FULL refresh is always cheaper. Detect this condition before executing the delta: ``` if affected_groups_estimate == total_groups: use FULL refresh ``` The affected groups can be estimated from the change buffer: `SELECT COUNT(DISTINCT group_key) FROM changes WHERE ...` If this equals the number of groups in the stream table, skip INCR. **Effort:** 3–4 hours. ### D-3: COST-Based Strategy Selection Currently, the strategy decision uses a simple ratio (`change_count / table_size > threshold`). A more sophisticated approach would estimate the actual cost: ``` estimated_incr_cost = delta_rows × (cte_count × scan_cost + merge_cost) estimated_full_cost = table_size × scan_cost + truncate_cost ``` Use the historical `pgt_refresh_history` data to calibrate the cost model per stream table. **Effort:** 6–8 hours. --- ## 8. Phase E: Memory and I/O Budget Management ### E-1: Delta Size Estimation Before Execution Before executing the MERGE, estimate the delta result set size: ```sql SELECT COUNT(*) FROM (delta_query LIMIT 10001) t ``` If the estimate exceeds a threshold, abort INCR and fall back to FULL. This prevents OOM for unexpected large deltas. **Effort:** 2–3 hours. ### E-2: work_mem Scaling The current `pg_trickle.merge_work_mem_mb` GUC sets a fixed work_mem for large deltas. A smarter approach: scale work_mem proportional to estimated delta size: ``` work_mem = max(base_work_mem, delta_rows * avg_row_width / 1024) ``` Capped at `pg_trickle.merge_work_mem_max_mb`. **Effort:** 2 hours. ### E-3: Change Buffer Partitioning for High-Write Tables For source tables with > 100K writes/minute, the change buffer table becomes a hotspot. Partition by `lsn` range (monthly/weekly) to allow fast TRUNCATE of old partitions. **Effort:** 8–12 hours. Defer until trigger overhead benchmarks confirm this is a bottleneck. --- ## 9. Benchmark Infrastructure Improvements ### Current Gaps The existing benchmark infrastructure has several gaps that limit insight into system performance: ### I-1: diff_operators Benchmark Cannot Run Locally **Problem:** The `diff_operators` Criterion bench links against `pg_trickle` library code that uses pgrx symbols (`SPI_processed`, `pg_sys::*`). On macOS without PostgreSQL server libraries installed, the dyld loader fails with `symbol not found in flat namespace '_SPI_processed'`. **Fix:** Refactor `diff_operators.rs` to depend only on pure-Rust types (`OpTree`, `DiffContext`, `Expr`, `Column`, `Frontier`). The diff operators themselves are pure Rust — the issue is that the benchmark binary links against the full `pg_trickle` crate which includes SPI-using modules. **Options:** | Option | Description | Effort | |--------|-------------|--------| | **(a) Feature-gate pgrx** | Add a `bench` feature that stubs out pgrx dependencies | 2–3 hours | | **(b) Extract pure Rust into sub-crate** | Move `dvm/`, `dag.rs`, `version.rs`, `hash.rs` to a `pg_trickle_core` crate that doesn't depend on pgrx | 6–8 hours | | **(c) Run inside Docker** | Run `cargo bench` inside the E2E Docker container where PG libs are available | 1 hour (justfile target) | **Recommended:** **(c) as quick fix**, **(b) as long-term solution.** Option (c) gets the benchmarks running immediately. Option (b) is the correct architectural fix — pure-Rust logic should not require a PostgreSQL installation to benchmark. ### I-2: No Latency Distribution / Histogram Output **Problem:** The E2E benchmarks report avg, median, P95, and cycle-1 vs cycles-2+ breakdowns. But there's no histogram or full distribution output. P95 spikes (up to 7.6x median) are noted but not diagnosed. **Fix:** Add per-cycle timing output as a parseable CSV-like format: ``` [BENCH_CYCLE] scenario=join rows=100000 pct=0.01 cycle=1 mode=INCR ms=44.3 [BENCH_CYCLE] scenario=join rows=100000 pct=0.01 cycle=2 mode=INCR ms=7.1 ... ``` This enables external analysis (histogram plots, outlier identification, trend detection) without changing the test code significantly. **Effort:** 2 hours. ### I-3: No EXPLAIN ANALYZE Capture for Delta Queries **Problem:** We know MERGE dominates (70–97% of time) but don't know what PostgreSQL is doing inside the MERGE. Is it using hash joins? Nested loops? Seq scans? Parallel workers? Are sorts spilling to disk? **Fix:** Add a benchmark mode that captures `EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)` for the delta query. Store the plan in `/tmp/bench_plans/__.json` for offline analysis. **Implementation:** Add a GUC or environment variable `PGS_BENCH_EXPLAIN=true` that, when set, runs each MERGE as `EXPLAIN ANALYZE` on the first measured cycle. **Effort:** 3–4 hours. ### I-4: No Cross-Run Comparison Tool **Problem:** Each benchmark run produces printed output. Comparing results across runs (e.g., before/after an optimization) requires manual side-by-side reading, as manually done in PLAN_PERFORMANCE_PART_8.md. **Fix:** Write benchmark results to a JSON file (`target/bench_results/.json`). Provide a comparison script: ```bash just bench-compare target/bench_results/2026-02-22.json target/bench_results/2026-02-26.json ``` Output: per-scenario deltas with color-coded improvement/regression. **Effort:** 4–6 hours. ### I-5: No Concurrent Writer Benchmarks **Problem:** All benchmarks use a single connection for DML. In production, multiple concurrent writers hit source tables simultaneously. This stress- tests trigger overhead differently (BIGSERIAL contention, change buffer index lock contention, WAL write serialization). **Fix:** Extend the trigger overhead benchmark (PLAN_TRIGGERS_OVERHEAD.md) to sweep writer concurrency: 1, 2, 4, 8 connections performing DML simultaneously. **Effort:** 4–6 hours (requires pgbench-style parallel harness or tokio::spawn tasks with separate sqlx connections). ### I-6: No Table Size Scaling Benchmark (1M+ Rows) **Problem:** The current matrix tests 10K and 100K rows. Production tables commonly have 1M–100M rows. At 1M rows / 1% change rate (10K changes), the delta is larger than anything currently benchmarked, and the MERGE touches more stream table rows for the `WHEN NOT MATCHED BY SOURCE` arm. **Fix:** Add a 1M-row tier to the benchmark matrix. Run as an optional `bench_large_matrix` test (separate from the standard suite due to longer run time). **Effort:** 2 hours (increase TABLE_SIZES; may need container memory bump). ### I-7: No Window / Lateral / CTE Operator Benchmarks **Problem:** The E2E benchmarks test only 5 scenarios (scan, filter, aggregate, join, join_agg). Window functions, lateral subqueries, CTEs, set operations, and their combinations are not benchmarked. **Fix:** Add scenarios: - `window`: `SELECT id, region, ROW_NUMBER() OVER (PARTITION BY region ORDER BY amount DESC) FROM src` - `lateral`: `SELECT s.*, l.top_score FROM src s, LATERAL (SELECT MAX(score) as top_score FROM src s2 WHERE s2.region = s.region) l` - `cte`: `WITH regional AS (SELECT region, SUM(amount) as total FROM src GROUP BY region) SELECT s.*, r.total FROM src s JOIN regional r ON s.region = r.region` - `union_all`: `SELECT id, amount, 'high' as tier FROM src WHERE amount > 5000 UNION ALL SELECT id, amount, 'low' FROM src WHERE amount <= 5000` **Effort:** 4–6 hours. ### I-8: Criterion Benchmark Noise Reduction **Problem:** The Criterion benchmarks show 10–19% outlier rates across many tests, suggesting system noise (background processes, thermal throttling, memory pressure). This makes it difficult to detect real regressions. **Fix:** - Increase `sample_size` from default 100 to 200 for critical benchmarks - Add `measurement_time(Duration::from_secs(10))` for fast benchmarks to get more iterations - Document that benchmarks should be run on a quiet system with `nice -n -20` (or macOS equivalent) - Consider adding a `bench-stable` justfile target that sets CPU frequency governor (Linux) or reduces measurement noise **Effort:** 1–2 hours. --- ## 10. Implementation Priority & Schedule ### Session 1: Verify & Fix Regressions (3–4 hours) ✅ Done | Step | Task | Effort | Status | |------|------|--------|--------| | A-1 | Verify adaptive threshold fix via E2E benchmark | 30 min | ✅ Done — 0.3x → 0.7x, D-2 triggers correctly | | A-2 | Verify prepared statements recover join regression | 30 min | ✅ Done — 12.9x speedup, warm-up 21% improvement | | A-3 | Fix `prefixed_col_list/20` regression | 1 hour | ✅ Done — eliminated Vec allocation | | A-4 | Investigate `lsn_gt` regression | 1 hour | ✅ Done — use `split_once` | ### Session 2: Benchmark Infrastructure (8–12 hours) ✅ Done | Step | Task | Effort | Status | |------|------|--------|--------| | I-1c | Run diff_operators in Docker (quick fix) | 1 hour | ✅ Done — `just bench-docker` target | | I-2 | Add per-cycle CSV output to E2E benchmarks | 2 hours | ✅ Done — `[BENCH_CYCLE]` lines | | I-3 | Add EXPLAIN ANALYZE capture mode | 3–4 hours | ✅ Done — `PGS_BENCH_EXPLAIN=true` | | I-6 | Add 1M-row benchmark tier | 2 hours | ✅ Done — `bench_*_1m_*` + `bench_large_matrix` | | I-8 | Reduce Criterion noise | 1 hour | ✅ Done — sample_size(200), measurement_time(10s) | ### Session 3: Statement-Level Triggers ✅ Complete | Step | Task | Effort | Status | |------|------|--------|--------| | B-1 | Implement statement-level trigger generation | 8 hours | ✅ Done | | B-2 | GUC + backward compatibility + migration | 4 hours | ✅ Done | | B-3 | Benchmark write-side improvement | 2 hours | ✅ Done | ### Session 4: Parallel Refresh (16–24 hours) ✅ Done | Step | Task | Effort | Status | |------|------|--------|--------| | C-1 | DAG level extraction | 2–4 hours | ✅ Done — `topological_levels()` on StDag & ExecutionUnitDag | | C-2 | Dynamic background worker dispatch | 12–16 hours | ✅ Done — existing `parallel_dispatch_tick` sufficient | | C-3 | Result communication + error handling | 3–4 hours | ✅ Done — existing `SchedulerJob` + `pgt_refresh_history` | ### Session 5: MERGE Optimization (8–12 hours) ✅ Done | Step | Task | Effort | Status | |------|------|--------|--------| | D-1 | Hash-based change detection for wide tables | 4–6 hours | ✅ Done — xxh64 via `pg_trickle_hash` | | D-2 | Conditional FULL bypass for saturated aggregates | 3–4 hours | ✅ Done — changes ≥ groups → FULL | | D-3 | Cost-based strategy selection | 6–8 hours | ✅ Done — history-based cost model blended with ratio | ### Session 6: Advanced Benchmarks (8–12 hours) ✅ Done | Step | Task | Effort | Status | |------|------|--------|--------| | I-4 | Cross-run comparison tool | 4–6 hours | ✅ Done — JSON output + `just bench-compare` | | I-5 | Concurrent writer benchmarks | 4–6 hours | ✅ Done — 1/2/4/8 writer sweep | | I-7 | Window / lateral / CTE operator benchmarks | 4–6 hours | ✅ Done — 4 new scenarios | ### Summary | Session | Focus | Effort | Value | |---------|-------|--------|-------| | 1 | Regression triage | 3–4h | ✅ Done: A-3/A-4 fixed; A-1/A-2 deferred to E2E run | | 2 | Benchmark infrastructure | 8–12h | ✅ Done: I-1c, I-2, I-3, I-6, I-8 | | 3 | Statement-level triggers | 12–16h | ✅ Done: 50–80% write-side overhead reduction | | 4 | Parallel refresh | 16–24h | ✅ Done: C-1/C-2/C-3 — level-parallel dispatch | | 5 | MERGE optimization | 8–12h | ✅ Done: D-1/D-2/D-3 — xxh64, saturation bypass, cost model | | 6 | Advanced benchmarks | 8–12h | ✅ Done: I-4/I-5/I-7 — comparison tool, concurrency, operators | | **Total** | | **55–80h** | | ### Recommended Execution Order ``` Session 1 → Verify regressions are fixed [PREREQUISITE for all] Session 2 → Benchmark infrastructure [Enables data-driven decisions] Session 3 → Statement-level triggers [Highest ROI: write-side] Session 5 → MERGE optimization [Read-side: strategy selection] Session 4 → Parallel refresh [Multi-ST scaling] Session 6 → Advanced benchmarks [Long-term observability] ``` Session 3 (statement-level triggers) is placed before Session 4 (parallel refresh) because it delivers measurable improvement to every deployment, while parallel refresh only helps multi-ST deployments. --- ## Appendix: Relationship to Other Plans | Plan | Relationship | |------|-------------| | [PLAN_PERFORMANCE_PART_8.md](PLAN_PERFORMANCE_PART_8.md) | Direct predecessor. Part 8's regression analysis (A-1 through A-3) feeds Session 1 here. | | [REPORT_PARALLELIZATION.md](REPORT_PARALLELIZATION.md) | Session 4 (parallel refresh) implements Options A+B from this report. | | [PLAN_TRIGGERS_OVERHEAD.md](PLAN_TRIGGERS_OVERHEAD.md) | Session 3 (statement-level triggers) builds on this benchmark design. | | [STATUS_PERFORMANCE.md](STATUS_PERFORMANCE.md) | Historical tracking. Update after each session completes. | | [GAP_SQL_PHASE_7.md](../sql/GAP_SQL_PHASE_7.md) | §F27 (expose adaptive threshold), §G4.6 (wide table MERGE), §G8.5 (sequential processing). | | [GAP_SQL_PHASE_7_QUESTIONS.md](../sql/GAP_SQL_PHASE_7_QUESTIONS.md) | Q7 (PgBouncer: FOR UPDATE SKIP LOCKED) affects parallel refresh locking strategy. |