# pg_trickle — Performance History How has incremental (DIFFERENTIAL) refresh performance improved over the optimization sessions, and what is the current state? --- ## Benchmark Setup All benchmarks use the same full matrix: **5 scenarios × 2 table sizes × 3 change rates = 30 combinations**, run inside isolated PostgreSQL 18.x Docker containers via Testcontainers. Each benchmark applies a realistic DML mix (INSERT/UPDATE/DELETE) per cycle and measures wall-clock refresh time. | Scenario | Defining Query | Operators Tested | |----------|---------------|------------------| | **scan** | `SELECT id, region, category, amount, score FROM src` | Table scan only | | **filter** | `SELECT id, region, amount FROM src WHERE amount > 5000` | Scan + filter | | **aggregate** | `SELECT region, SUM(amount), COUNT(*) FROM src GROUP BY region` | Scan + group-by aggregate | | **join** | `SELECT s.id, s.region, s.amount, d.region_name FROM src s JOIN dim d ON ...` | Scan + inner join | | **join_agg** | `SELECT d.region_name, SUM(s.amount), COUNT(*) FROM src s JOIN dim d ON ... GROUP BY ...` | Scan + join + aggregate | Table sizes: **10,000** and **100,000** rows. Change rates: **1%**, **10%**, **50%**. --- ## Snapshot Timeline Four benchmark snapshots were captured across the optimization sessions: | Snapshot | What Changed | |----------|-------------| | **Baseline** | No optimizations | | **After P1+P2** | Phase A: `to_jsonb()` trigger serialization, drop unused columns, LSN bounds. Phase B: cached MERGE SQL + cached OpTree/delta template | | **After P6** | + `LATERAL VALUES` aggregate rewrite, PK resolution, no-change short-circuit, EXISTS early-exit, prepared statements | | **After P7** | + deferred cleanup, warm-cycle tuning | --- ## Optimization Details ### Phase A — Quick Wins (P1) - **Fix 3:** Replaced `row_to_json()::text` with `to_jsonb()` in CDC trigger functions for faster serialization. - **Fix 4:** Dropped unused `old_row_data` column from change buffer tables, reducing I/O. - **Fix 11:** Added upper LSN bound to change buffer queries and cleanup, preventing unbounded scans. ### Phase B — Caching (P2) - **Fix 9:** Cached compiled MERGE SQL across refresh cycles, eliminating repeated query planning (~30ms saved per refresh). - **Fix 10:** Cached parsed OpTree and delta SQL template, avoiding redundant parsing and differentiation (~15ms saved per refresh). ### Phase C — Operator Improvements (P3–P6) - **Fix 5:** Replaced 4-branch UNION ALL in aggregate final delta CTE with single-pass `LATERAL VALUES` expansion — eliminates 3 redundant scans of the merge CTE. - **Fix 6:** Primary key columns resolved from `pg_constraint` at parse time and stored in `OpTree::Scan`, preferring real PKs over the non-nullable column heuristic. - **No-change short-circuit:** When the change buffer is empty, skip MERGE entirely and return in <2ms. - **EXISTS early-exit:** Before building the delta query, run a cheap `SELECT EXISTS(...)` on the change buffer to detect no-ops. - **Prepared statements:** Reuse prepared MERGE statements across cycles to avoid repeated planning. ### Phase D — Cleanup & Tuning (P7) - **Deferred cleanup:** Change buffer cleanup moved to after the MERGE completes, reducing lock contention during the critical path. - **Warm-cycle tuning:** Improved steady-state performance for cycles 2+ after initial cold start. --- ## Historical Speedup: INCREMENTAL vs FULL ### At 1% Change Rate (Typical Production Workload) | Scenario | Rows | Baseline | After P1+P2 | After P6 | After P7 (Current) | |----------|------|----------|-------------|----------|---------------------| | **scan** | 10K | 2.6× | 2.6× | 10.1× | 4.4× | | **scan** | 100K | **0.7×** (slower!) | 2.7× | **41.7×** | 7.0× | | **filter** | 10K | 1.9× | 1.7× | 5.4× | 3.3× | | **filter** | 100K | 2.5× | 2.7× | **26.3×** | 7.9× | | **aggregate** | 10K | 1.4× | 1.2× | 1.7× | 1.5× | | **aggregate** | 100K | 1.3× | 1.4× | **4.4×** | 2.6× | | **join** | 10K | 2.3× | 1.6× | 6.7× | 2.9× | | **join** | 100K | 2.6× | 2.7× | **30.4×** | 6.0× | | **join_agg** | 10K | 1.0× | 1.0× | 0.9× | 0.7× | | **join_agg** | 100K | 1.3× | 1.5× | 4.9× | 3.1× | ### At 10% Change Rate | Scenario | Rows | Baseline | After P1+P2 | After P6 | After P7 (Current) | |----------|------|----------|-------------|----------|---------------------| | **scan** | 10K | 2.1× | 2.2× | 6.4× | 5.0× | | **scan** | 100K | 0.8× | 1.8× | 4.5× | 3.3× | | **filter** | 10K | 1.4× | 2.2× | 3.3× | 1.9× | | **filter** | 100K | 1.6× | 1.6× | 3.5× | 3.1× | | **aggregate** | 10K | 0.9× | 0.8× | 1.2× | 1.1× | | **aggregate** | 100K | 0.6× | 0.6× | 1.1× | 0.8× | | **join** | 10K | 1.4× | 1.4× | 4.1× | 1.8× | | **join** | 100K | 1.5× | 1.9× | 3.1× | 3.3× | | **join_agg** | 10K | 0.8× | 0.7× | 1.8× | 0.9× | | **join_agg** | 100K | 0.7× | 0.7× | 0.9× | 0.8× | ### At 50% Change Rate | Scenario | Rows | Baseline | After P1+P2 | After P6 | After P7 (Current) | |----------|------|----------|-------------|----------|---------------------| | **scan** | 10K | 1.1× | 1.1× | 3.1× | 2.5× | | **scan** | 100K | 0.9× | 0.9× | 0.9× | 0.8× | | **filter** | 10K | 1.2× | 1.2× | 0.7× | 0.9× | | **filter** | 100K | 1.1× | 1.8× | 1.0× | 1.1× | | **aggregate** | 10K | 1.2× | 1.7× | 1.6× | 1.0× | | **aggregate** | 100K | 0.9× | 0.9× | 0.8× | 0.6× | | **join** | 10K | 1.1× | 1.1× | 1.1× | 1.0× | | **join** | 100K | 1.1× | 1.1× | 0.9× | 0.9× | | **join_agg** | 10K | 1.3× | 1.1× | 0.6× | 1.4× | | **join_agg** | 100K | 0.7× | 0.9× | 0.8× | 0.7× | --- ## Current State Summary (After P7) ### At 1% Change Rate (Typical Production Workload) | Scenario | 10K rows | 100K rows | |----------|----------|-----------| | scan | **4.4×** faster | **7.0×** faster | | filter | **3.3×** faster | **7.9×** faster | | aggregate | **1.5×** faster | **2.6×** faster | | join | **2.9×** faster | **6.0×** faster | | join+agg | 0.7× (overhead) | **3.1×** faster | ### At 10% Change Rate | Scenario | 10K rows | 100K rows | |----------|----------|-----------| | scan | **5.0×** faster | **3.3×** faster | | filter | **1.9×** faster | **3.1×** faster | | aggregate | 1.1× | 0.8× | | join | **1.8×** faster | **3.3×** faster | | join+agg | 0.9× | 0.8× | ### At 50% Change Rate At 50% churn, INCREMENTAL approaches or slightly exceeds FULL cost across most scenarios (0.6–2.5×). This is expected — when half the table changes, the delta is nearly as expensive as a full recompute. --- ## Absolute Timings (Current, 100K Rows, 1% Changes) | Scenario | FULL avg ms | INCREMENTAL avg ms | Speedup | |----------|-------------|---------------------|---------| | scan | 326 | 47 | 7.0× | | filter | 267 | 34 | 7.9× | | aggregate | 28 | 11 | 2.6× | | join | 384 | 64 | 6.0× | | join_agg | 67 | 22 | 3.1× | --- ## Interpretation ### What the Numbers Show 1. **Baseline was rough** — INCREMENTAL was actually *slower* than FULL for scan/100K (0.7×) and several other combos were near 1.0×. The overhead of delta computation, MERGE, and change buffering exceeded the cost of just re-running the query. 2. **P1+P2 (caching)** fixed the worst regressions. Caching the compiled MERGE SQL and parsed OpTree eliminated ~45ms of repeated planning per cycle. Large-table scan went from 0.7× to 2.7×. 3. **P6 was the breakthrough** — no-change short-circuit, EXISTS early-exit, and prepared statements produced the most dramatic gains. At 1% changes on 100K rows: - scan: **41.7×** (346ms → 8.3ms) - filter: **26.3×** (196ms → 7.4ms) - join: **30.4×** (374ms → 12.3ms) These extreme multipliers reflect that in a 10-cycle run, most cycles have **no changes** (the DML happens once, then subsequent cycles find empty change buffers and short-circuit in <2ms). 4. **P7 (current)** shows somewhat lower peak numbers than P6 due to benchmark variability (different run, cold-start cycle 1 weighing on averages) and the addition of deferred cleanup overhead. The steady-state "cycle 2+" numbers in the P6 detailed output are the most representative of real-world performance. ### Theoretical Basis The larger the table and the smaller the change rate, the bigger the INCREMENTAL win. Differential maintenance is O(Δ) while full refresh is O(N), so the ratio improves as N/Δ grows. At 100K rows / 1% changes the system processes ~1,000 changed rows instead of re-scanning 100,000, yielding **3–8× speedups** in practice (and up to 30–40× when most cycles are no-ops). ### When INCREMENTAL Doesn't Help - **50% change rate on large tables:** The delta is nearly as big as the table itself. - **Small aggregates (join_agg at 10K):** The FULL refresh is already ~12ms; the overhead of delta computation, MERGE template, and cleanup exceeds the savings. - **Aggregate 100K at 10–50%:** Group-rescan aggregates require reading the stream table to reconstruct counts, which at high change rates approaches full-table cost. --- ## Running Benchmarks ```bash # Build the E2E test image first ./tests/build_e2e_image.sh # Quick spot check cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_scan_10k_1pct # Full 30-combination matrix (~15–30 minutes) cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_full_matrix ``` See [BENCHMARK.md](../../docs/BENCHMARK.md) for full details on the benchmark harness. --- ## Design Decisions & Root Causes This section preserves the key technical decisions and root-cause findings from the performance analysis phases (PLAN_PERFORMANCE through PLAN_PERFORMANCE_PART_7). These files were deleted after extracting the information below. ### Original Root Cause (Phase 0) The initial benchmark showed INCREMENTAL was *slower* than FULL for scan/100K (0.7×). Root cause: `execute_manual_refresh()` ignored the `refresh_mode` parameter and always performed TRUNCATE + INSERT — effectively comparing FULL vs FULL with extra overhead. This was the single most impactful bug discovered. ### MERGE vs Temp Table (Phase A) The original incremental apply used a 3-step SPI sequence: create temp table, DELETE matching rows, INSERT new rows. This was replaced with a single PostgreSQL `MERGE` statement that handles INSERT/UPDATE/DELETE in one operation. Result: 3 SPI calls → 1, ~30% less I/O, atomic application. ### JSONB Serialization Round-Trip (Phases A–B) JSONB serialization was identified as the **dominant bottleneck** — not query planning as initially suspected. The full path: `row → JSON text → binary JSONB → store → read → typed row`. Two key fixes: 1. **Trigger serialization**: Replaced `row_to_json()::jsonb` with `to_jsonb(NEW)` — eliminates the text→binary conversion step. 2. **Row reconstruction**: Replaced per-column `->>'col'` extraction with `jsonb_populate_record()` — a single function call reconstructs the entire row instead of N separate field extractions. ### SPI Call Reduction (Phases B–C) The no-op refresh path originally executed **7+ SPI round-trips** (13+ through the scheduler) even with zero pending changes. Key eliminations: - **Redundant `SELECT count(*)`**: A full sequential scan of every change buffer was purely informational and served no purpose in the trigger-based CDC model. Replaced with `SELECT EXISTS(... LIMIT 1)`. - **Combined SPI queries**: `reltuples` + capped-count queries merged via `LATERAL`. `store_frontier` + `complete_refresh` merged into single UPDATE. - **No-op latency**: 86ms → 14.8ms → sub-2ms. ### Why PREPARE/EXECUTE Was Rejected (Phase D) Prepared statements (`PREPARE` + `EXECUTE`) were implemented and then **reverted**. Root cause analysis: 1. **Custom plans (first ≤5 executions)** have the **same planning cost** as direct `SPI_execute()`, plus the overhead of looking up the prepared statement and binding arguments. No benefit for short-cycle benchmarks (CYCLES ≤ 5). 2. **Generic plans (≥6 executions)** skip re-planning but use **wrong selectivity** for LSN range predicates — without concrete LSN values, the planner uses average-case cardinality estimates, leading to wrong sort strategies for GROUP BY and join methods. 3. **String substitution** with concrete parameter values provides correct plans cheaply on every call. ### Scan Window Function Fast Path (Phase E) At 1% changes, ~95%+ of PKs have exactly one change per refresh cycle. The expensive `FIRST_VALUE`/`LAST_VALUE` window functions (requiring full partition sort) are unnecessary for these single-change PKs. A 4-CTE split pipeline (`pk_stats → single → multi_raw → scan_raw`) routes single-change PKs through a direct emit path, skipping window functions entirely. Impact: scan 100K/1% went from 46.5ms → 8.3ms (5.6×). ### Aggregate Dedup Elimination (Phase E) Aggregates originally emitted D+I (delete+insert) row pairs per updated group, requiring a `DISTINCT ON` sort in the MERGE wrapper. A CASE-based single-row emit marks each group with the correct action directly (`'I'` for new groups, `'D'` for deleted groups, `'I'` for updated groups), making aggregate output inherently deduplicated. Eliminates the O(n log n) sort. ### Join Hash Simplification (Phase E) Original join row IDs used nested hash calls: `pg_trickle_hash_multi(ARRAY[dl.__pgt_row_id::TEXT, pg_trickle_hash_multi(ARRAY[r."id"::TEXT])::TEXT])` — 2 Rust FFI crossings + 3 TEXT casts per row. Flattened to a single `pg_trickle_hash_multi(ARRAY[...])` call, reducing from 2 hash calls to 1 per row. ### Where Time is Spent (Current) With pipeline overhead reduced to sub-2ms on cache hits, **MERGE execution dominates** all scenarios (69–93% of total refresh time). The remaining performance is determined by PostgreSQL's execution of the delta CTE chain and MERGE statement — not by the extension's orchestration layer. | Phase | 100K/1% | 100K/10% | |-------|---------|----------| | Decision | <1ms | <2ms | | Gen+Build | <1ms | <1ms | | **MERGE** | **3–8ms** | **13–98ms** | | Cleanup | <1ms | 3–5ms | ### Warm-Cycle Performance (INCR cycles 2+, excludes cold-start) | Scenario | 10K/1% | 100K/1% | 100K/10% | |----------|--------|---------|----------| | scan | 6.9ms | 12.8ms | 107.8ms | | filter | 7.8ms | 17.0ms | 54.3ms | | join | 10.0ms | 29.9ms | 111.7ms | | aggregate | 6.0ms | 8.3ms | 40.7ms | | join_agg | 17.5ms | 18.5ms | 52.9ms | ### Full Evolution (scan 100K/1% — flagship scenario) | Phase | INCR ms | INCR/FULL Ratio | Key Change | |-------|---------|-----------------|------------| | Baseline | 572.4 | 0.7× | Bug: manual refresh ignored mode | | After P1+P2 | ~135 | 2.7× | Caching + JSONB fixes | | After P3 | ~135 | 2.5× | SPI reduction (count→EXISTS) | | After P5 (P7) | 46.5 | 7.0× | LATERAL VALUES, deferred cleanup | | After P6 (Part 6) | **8.3** | **41.7×** | Scan fast path, join hash, aggregate dedup | ### Adaptive Threshold The system automatically falls back from INCREMENTAL to FULL refresh when the change rate exceeds a configurable threshold (default: 15%). This prevents the pathological case where delta processing on large change sets is slower than full recomputation. The threshold can be tuned per stream table via the `pg_trickle.differential_max_change_ratio` GUC. --- ## Part 9 — Regression Fixes & Benchmark Infrastructure (Sessions 1–2) **Date:** 2026-03-15 ### Rust-Side Micro-Optimizations | Function | Before | After | Change | Root Cause | |----------|--------|-------|--------|------------| | `prefixed_col_list/20` | ~4.0 µs | Expected ~2.8 µs | ~30% improvement | Eliminated intermediate `Vec` allocation; stream directly to `String` | | `lsn_gt("0/0", "0/1")` | ~79 ns | Expected ~55 ns | ~30% improvement | Replaced `split('/').collect::>()` with `split_once('/')` | | `col_list/10` | ~1.37 µs | Expected ~1.1 µs | ~20% improvement | Same optimization as `prefixed_col_list` | ### Benchmark Infrastructure Improvements | ID | Improvement | Impact | |----|-------------|--------| | I-1c | `just bench-docker` target | Run Criterion inside Docker when local pg_stub fails | | I-2 | `[BENCH_CYCLE]` parseable output | Enables histogram analysis, trend detection | | I-3 | `PGS_BENCH_EXPLAIN=true` | Captures EXPLAIN (ANALYZE, BUFFERS) for delta queries | | I-6 | 1M-row benchmark tier | Tests production-scale tables (10K changes at 1%) | | I-8 | `sample_size(200)`, `measurement_time(10s)` | Reduces outlier rate from 10–19% to expected <5% |