# pg_trickle — Benchmark Guide

This document explains how the database-level refresh benchmarks work and how to interpret their output.

---

## Overview

The benchmark suite in `tests/e2e_bench_tests.rs` measures **wall-clock refresh time** for **FULL** vs **INCREMENTAL** mode across a matrix of table sizes, change rates, and query complexities. Each benchmark spawns an isolated PostgreSQL 18.x container via Testcontainers, ensuring reproducible and interference-free measurements.

The core question the benchmarks answer:

> **How much faster is an INCREMENTAL refresh compared to a FULL refresh, given a specific workload?**

---

## Prerequisites

Build the E2E test Docker image before running any benchmarks:

```bash
./tests/build_e2e_image.sh
```

Docker must be running on the host.

---

## Running Benchmarks

All benchmark tests are tagged `#[ignore]` so they are skipped during normal CI. The `--nocapture` flag is required to see the printed output tables.

### Quick Spot Checks (~5–10 seconds each)

```bash
# Simple scan, 10K rows, 1% change rate
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_scan_10k_1pct

# Aggregate query, 100K rows, 1% change rate
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_aggregate_100k_1pct

# Join + aggregate, 100K rows, 10% change rate
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_join_agg_100k_10pct
```

### Zero-Change Latency (~5 seconds)

```bash
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_no_data_refresh_latency
```

### Full Matrix (~15–30 minutes)

Runs all 30 combinations and prints a consolidated summary:

```bash
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture bench_full_matrix
```

### Run All Benchmarks in Parallel

```bash
cargo test --test e2e_bench_tests --features pg18 -- --ignored --nocapture
```

Note: each test starts its own container, so parallel execution requires sufficient Docker resources.

---

## Benchmark Dimensions

### Table Sizes

| Size | Rows | Purpose |
|------|------|---------|
| Small | 10,000 | Fast iteration; measures per-row overhead |
| Medium | 100,000 | More realistic; reveals scaling characteristics |

### Change Rates

| Rate | Description |
|------|-------------|
| 1% | Low churn — the sweet spot for incremental refresh |
| 10% | Moderate churn — tests delta query scalability |
| 50% | High churn — stress test; approaches full-refresh cost |

### Query Complexities

| Scenario | Defining Query | Operators Tested |
|----------|---------------|------------------|
| **scan** | `SELECT id, region, category, amount, score FROM src` | Table scan only |
| **filter** | `SELECT id, region, amount FROM src WHERE amount > 5000` | Scan + filter (WHERE) |
| **aggregate** | `SELECT region, SUM(amount), COUNT(*) FROM src GROUP BY region` | Scan + group-by aggregate |
| **join** | `SELECT s.id, s.region, s.amount, d.region_name FROM src s JOIN dim d ON ...` | Scan + inner join |
| **join_agg** | `SELECT d.region_name, SUM(s.amount), COUNT(*) FROM src s JOIN dim d ON ... GROUP BY ...` | Scan + join + aggregate |

### DML Mix per Cycle

Each change cycle applies a realistic mix of operations:

| Operation | Fraction | Example at 10K rows, 10% rate |
|-----------|----------|-------------------------------|
| UPDATE | 70% | 700 rows have `amount` incremented |
| DELETE | 15% | 150 rows removed |
| INSERT | 15% | 150 new rows added |

---

## What Each Benchmark Does

```
1. Start a fresh PostgreSQL 18.x container
2. Install the pg_trickle extension
3. Create and populate the source table (10K or 100K rows)
4. Create dimension table if needed (for join scenarios)
5. ANALYZE for stable query plans

── FULL mode ──
6. Create a Stream Table in FULL refresh mode
7. For each of 3 cycles:
   a. Apply random DML (updates + deletes + inserts)
   b. ANALYZE
   c. Time the FULL refresh (TRUNCATE + re-execute entire query)
   d. Record refresh_ms and ST row count
8. Drop the FULL-mode ST

── INCREMENTAL mode ──
9. Reset source table to same starting state
10. Create a Stream Table in INCREMENTAL refresh mode
11. For each of 3 cycles:
    a. Apply random DML (same parameters)
    b. ANALYZE
    c. Time the INCREMENTAL refresh (delta query + MERGE)
    d. Record refresh_ms and ST row count

12. Print results table and summary
```

Both modes start from the same data to ensure a fair comparison. The 3-cycle design captures warm-up effects (cycle 1 may be slower due to plan caching).

---

## Reading the Output

### Detail Table

```
╔══════════════════════════════════════════════════════════════════════════════════════╗
║                    pg_trickle Refresh Benchmark Results                      ║
╠════════════╤══════════╤════════╤═════════════╤═══════╤════════════╤═════════════════╣
║ Scenario   │ Rows     │ Chg %  │ Mode        │ Cycle │ Refresh ms │ ST Rows         ║
╠════════════╪══════════╪════════╪═════════════╪═══════╪════════════╪═════════════════╣
║ aggregate  │    10000 │     1% │ FULL        │     1 │       22.1 │               5 ║
║ aggregate  │    10000 │     1% │ FULL        │     2 │        4.8 │               5 ║
║ aggregate  │    10000 │     1% │ FULL        │     3 │        5.3 │               5 ║
║ aggregate  │    10000 │     1% │ INCREMENTAL │     1 │        8.4 │               5 ║
║ aggregate  │    10000 │     1% │ INCREMENTAL │     2 │        4.4 │               5 ║
║ aggregate  │    10000 │     1% │ INCREMENTAL │     3 │        4.6 │               5 ║
╚════════════╧══════════╧════════╧═════════════╧═══════╧════════════╧═════════════════╝
```

| Column | Meaning |
|--------|---------|
| **Scenario** | Query complexity level (scan, filter, aggregate, join, join_agg) |
| **Rows** | Number of rows in the base table |
| **Chg %** | Percentage of rows changed per cycle |
| **Mode** | FULL (truncate + recompute) or INCREMENTAL (delta + merge) |
| **Cycle** | Which of the 3 measurement rounds (cycle 1 often includes warm-up) |
| **Refresh ms** | Wall-clock time for the refresh operation |
| **ST Rows** | Row count in the Stream Table after refresh (sanity check) |

### Summary Table

```
┌─────────────────────────────────────────────────────────────────────────┐
│                        Summary (avg ms per cycle)                       │
├────────────┬──────────┬────────┬─────────────────┬──────────────────────┤
│ Scenario   │ Rows     │ Chg %  │ FULL avg ms     │ INCREMENTAL avg ms   │
├────────────┼──────────┼────────┼─────────────────┼──────────────────────┤
│ aggregate  │    10000 │     1% │       10.7       │        5.8 (  1.8x) │
└────────────┴──────────┴────────┴─────────────────┴──────────────────────┘
```

The **Speedup** value in parentheses is `FULL avg / INCREMENTAL avg` — how many times faster the incremental refresh is compared to a full refresh.

---

## Interpreting the Speedup

### What to Expect

| Change Rate | Table Size | Expected Speedup | Explanation |
|-------------|------------|-------------------|-------------|
| 1% | 10K | 1.5–5x | Small table; overhead is similar, delta is tiny |
| 1% | 100K | 5–50x | Larger table amplifies full-refresh cost |
| 10% | 100K | 2–10x | Moderate delta; still significantly faster |
| 50% | any | 1–2x | Delta is nearly as large as full table |

### Rules of Thumb

| Speedup | Interpretation |
|---------|---------------|
| **> 10x** | Strong win for INCREMENTAL — typical at low change rates on larger tables |
| **5–10x** | Clear advantage for INCREMENTAL |
| **2–5x** | Moderate advantage — INCREMENTAL is the right choice |
| **1–2x** | Marginal gain — either mode is acceptable |
| **~1x** | Break-even — change rate is too high for incremental to help |
| **< 1x** | INCREMENTAL is slower — would indicate overhead exceeds savings (investigate) |

### Key Patterns to Look For

1. **Scaling with table size**: For the same change rate, speedup should increase with table size. FULL must re-process all rows; INCREMENTAL processes only the delta.

2. **Degradation with change rate**: As change rate rises from 1% → 50%, speedup should decrease. At 50%, INCREMENTAL processes half the table which approaches FULL cost.

3. **Query complexity amplifies speedup**: Aggregate and join queries benefit more from INCREMENTAL because they avoid expensive re-computation. A join_agg at 1% changes should show higher speedup than a simple scan at the same parameters.

4. **Cycle 1 warm-up**: The first cycle in each mode may be slower due to PostgreSQL plan cache population. Use cycles 2–3 for the steadiest numbers.

5. **ST Rows consistency**: The ST row count should be similar between FULL and INCREMENTAL for the same scenario (accounting for random DML). Large discrepancies indicate a correctness issue.

---

## Zero-Change Latency

The `bench_no_data_refresh_latency` test measures the overhead of a refresh when **no data has changed** — the NO_DATA code path.

```
┌──────────────────────────────────────────────┐
│ NO_DATA Refresh Latency (10 iterations)      │
├──────────────────────────────────────────────┤
│ Avg:     3.21 ms                             │
│ Max:     5.10 ms                             │
│ Target: < 10 ms                              │
│ Status: ✅ PASS                              │
└──────────────────────────────────────────────┘
```

| Metric | Meaning |
|--------|---------|
| **Avg** | Average wall-clock time across 10 no-op refreshes |
| **Max** | Worst-case single iteration |
| **Target** | The [PLAN.md](../plans/PLAN.md) goal: < 10 ms per no-op refresh |
| **Status** | PASS if avg < 10 ms, SLOW otherwise |

A passing result confirms the scheduler's per-cycle overhead is negligible. Values > 10 ms in containerized environments may be acceptable due to Docker overhead; bare-metal PostgreSQL should comfortably meet the target.

---

## Available Tests

### Individual Tests (10K rows)

| Test Name | Scenario | Change Rate |
|-----------|----------|-------------|
| `bench_scan_10k_1pct` | scan | 1% |
| `bench_scan_10k_10pct` | scan | 10% |
| `bench_scan_10k_50pct` | scan | 50% |
| `bench_filter_10k_1pct` | filter | 1% |
| `bench_aggregate_10k_1pct` | aggregate | 1% |
| `bench_join_10k_1pct` | join | 1% |
| `bench_join_agg_10k_1pct` | join_agg | 1% |

### Individual Tests (100K rows)

| Test Name | Scenario | Change Rate |
|-----------|----------|-------------|
| `bench_scan_100k_1pct` | scan | 1% |
| `bench_scan_100k_10pct` | scan | 10% |
| `bench_scan_100k_50pct` | scan | 50% |
| `bench_aggregate_100k_1pct` | aggregate | 1% |
| `bench_aggregate_100k_10pct` | aggregate | 10% |
| `bench_join_agg_100k_1pct` | join_agg | 1% |
| `bench_join_agg_100k_10pct` | join_agg | 10% |

### Special Tests

| Test Name | Description |
|-----------|-------------|
| `bench_full_matrix` | All 30 combinations (5 queries × 2 sizes × 3 rates) |
| `bench_no_data_refresh_latency` | Zero-change overhead (10 iterations) |

---

## In-Process Micro-Benchmarks (Criterion.rs)

In addition to the E2E database benchmarks, the project includes two **Criterion.rs** benchmark suites that measure pure Rust computation time without database overhead. These are useful for tracking performance regressions in the internal query-building and IVM differentiation logic.

### Benchmark Suites

#### `refresh_bench` — Utility Functions

`benches/refresh_bench.rs` benchmarks the low-level helper functions used during refresh operations:

| Benchmark Group | What It Measures |
|----------------|------------------|
| **quote_ident** | PostgreSQL identifier quoting speed |
| **col_list** | Column list SQL generation |
| **prefixed_col_list** | Prefixed column list generation (e.g., `NEW.col`) |
| **expr_to_sql** | AST expression → SQL string conversion |
| **output_columns** | Output column extraction from parsed queries |
| **source_oids** | Source table OID resolution |
| **lsn_gt** | LSN comparison expression generation |
| **frontier_json** | Frontier state JSON serialization |
| **canonical_period** | Interval parsing and canonicalization |
| **dag_operations** | DAG topological sort and cycle detection |
| **xxh64** | xxHash-64 hashing throughput |

#### `diff_operators` — IVM Operator Differentiation

`benches/diff_operators.rs` benchmarks the delta SQL generation for every IVM operator. Each benchmark creates a realistic operator tree and measures `differentiate()` throughput:

| Benchmark Group | What It Measures |
|----------------|------------------|
| **diff_scan** | Table scan differentiation (3, 10, 20 columns) |
| **diff_filter** | Filter (WHERE) differentiation |
| **diff_project** | Projection (SELECT subset) differentiation |
| **diff_aggregate** | GROUP BY aggregate differentiation (simple + complex) |
| **diff_inner_join** | Inner join differentiation |
| **diff_left_join** | Left outer join differentiation |
| **diff_distinct** | DISTINCT differentiation |
| **diff_union_all** | UNION ALL differentiation (2, 5, 10 children) |
| **diff_window** | Window function differentiation |
| **diff_join_aggregate** | Composite join + aggregate pipeline |
| **differentiate_full** | Full `differentiate()` call for scan-only and filter+scan trees |

### Running Micro-Benchmarks

```bash
# Run all Criterion benchmarks
just bench

# Run only refresh utility benchmarks
cargo bench --bench refresh_bench --features pg18

# Run only IVM diff operator benchmarks
just bench-diff
# or equivalently:
cargo bench --bench diff_operators --features pg18

# Output in Bencher-compatible format (for CI integration)
just bench-bencher
```

### Output and Reports

Criterion produces statistical analysis for each benchmark including:

- **Mean** and **standard deviation** of execution time
- **Throughput** (iterations/sec)
- **Comparison with previous run** — reports improvements/regressions with confidence intervals

HTML reports are generated in `target/criterion/` with interactive charts showing distributions and regression history. Open `target/criterion/report/index.html` to browse all results.

Sample output:

```
diff_scan/3_columns   time:   [11.834 µs 12.074 µs 12.329 µs]
diff_scan/10_columns  time:   [16.203 µs 16.525 µs 16.869 µs]
diff_aggregate/simple time:   [21.447 µs 21.862 µs 22.301 µs]
diff_inner_join       time:   [25.919 µs 26.421 µs 26.952 µs]
```

---

## Continuous Benchmarking with Bencher

[Bencher](https://bencher.dev) provides continuous benchmark tracking in CI, detecting performance regressions on pull requests before they merge.

### How It Works

The `.github/workflows/benchmarks.yml` workflow:

1. **On `main` pushes** — runs both Criterion suites and uploads results to Bencher as the baseline. This establishes the expected performance for each benchmark.

2. **On pull requests** — runs the same benchmarks and compares against the `main` baseline using a **Student's t-test** with a 99% upper confidence boundary. If any benchmark regresses beyond the threshold, the PR check fails.

### Setup

To enable Bencher for your fork or deployment:

1. **Create a Bencher account** at [bencher.dev](https://bencher.dev) and create a project.

2. **Add the API token** as a GitHub Actions secret:
   - Go to **Settings → Secrets and variables → Actions**
   - Add `BENCHER_API_TOKEN` with your Bencher API token

3. **Update the project slug** in `.github/workflows/benchmarks.yml` if your Bencher project name differs from `pg-trickle`.

The workflow gracefully degrades — if `BENCHER_API_TOKEN` is not set, benchmarks still run and upload artifacts but skip Bencher tracking.

### Local Bencher-Format Output

To see what Bencher would receive from CI:

```bash
just bench-bencher
```

This runs both suites with `--output-format bencher`, producing JSON output compatible with `bencher run`.

### Dashboard

Once configured, the Bencher dashboard shows:

- **Historical trends** for every benchmark across commits
- **Statistical thresholds** with configurable alerting
- **PR annotations** highlighting which benchmarks regressed and by how much

---

## Troubleshooting

| Issue | Resolution |
|-------|-----------|
| `docker: command not found` | Install Docker Desktop and ensure it is running |
| Container startup timeout | Increase Docker memory allocation (≥ 4 GB recommended) |
| `image not found` | Run `./tests/build_e2e_image.sh` to build the test image |
| Highly variable timings | Close other workloads; use `--test-threads=1` to avoid container contention |
| SLOW status on latency test | Expected in Docker; bare-metal should pass < 10 ms |