# PLAN: External Test Suites for pg_trickle

**Status:** Proposed  
**Date:** 2026-02-25  
**Branch:** `main`  
**Scope:** Adopt public SQL test suites and benchmarks to validate correctness and performance of stream tables beyond the project's existing test infrastructure.

---

## Table of Contents

1. [Motivation](#motivation)
2. [Suite 1: TPC-H-Derived — Analytical Workload](#suite-1-tpc-h-derived--analytical-workload)
3. [Suite 2: sqllogictest — Broad Correctness](#suite-2-sqllogictest--broad-correctness)
4. [Suite 3: Join Order Benchmark (JOB) — Real-World Data](#suite-3-join-order-benchmark-job--real-world-data)
5. [Suite 4: Feldera / DBSP Benchmarks — Theory Peer Comparison](#suite-4-feldera--dbsp-benchmarks--theory-peer-comparison)
6. [Implementation Roadmap](#implementation-roadmap)
7. [Infrastructure Requirements](#infrastructure-requirements)

---

## Motivation

pg_trickle's existing test suite (878 unit tests, 22 E2E test suites) validates
internal correctness. What it does **not** cover:

- **Industry-standard workloads** — are stream tables effective on the
  queries real analysts write?
- **Scale** — does differential refresh maintain its advantage at 1 GB, 10 GB,
  100 GB?
- **Breadth** — are there SQL patterns we haven't thought to test that produce
  incorrect deltas?
- **Peer comparison** — how does pg_trickle's approach compare to standalone
  IVM engines like Feldera on the same workload?

Public test suites address all four gaps.

---

## Suite 1: TPC-H-Derived — Analytical Workload

> **TPC-H Fair Use:** Our test suite is *derived from* the TPC-H Benchmark
> specification but does **not** constitute a TPC-H Benchmark result. We use
> a custom SQL data generator (not `dbgen`), modified queries, and a
> non-standard RF3. "TPC-H" is a trademark of the Transaction Processing
> Performance Council ([tpc.org](https://www.tpc.org/)).

### Why a TPC-H-Derived Workload

The TPC-H specification defines an industry-standard decision-support
schema and query set. Its 22 queries cover joins (up to 8 tables),
aggregates, subqueries (scalar, EXISTS, IN), HAVING, CASE WHEN, DISTINCT —
all operators pg_trickle supports in DIFFERENTIAL mode. Its two **refresh
functions** (RF1: bulk INSERT, RF2: bulk DELETE) directly exercise the
CDC → delta refresh pipeline.

**Specification:** https://www.tpc.org/tpch/  
**Our data generator:** Custom pure-SQL (`generate_series`), not `dbgen`

### Query Compatibility Analysis

All 22 TPC-H queries are compatible with pg_trickle:

| Category | Queries | Modification |
|----------|---------|-------------|
| Works as-is | Q6, Q14, Q17, Q19 | None |
| Remove ORDER BY only | Q1, Q4, Q5, Q7, Q8, Q9, Q11, Q12, Q13, Q15, Q16, Q20, Q22 | Cosmetic — ORDER BY is silently ignored anyway |
| TopK (ORDER BY + LIMIT) | Q2, Q3, Q10, Q18, Q21 | ✅ Supported — ORDER BY + LIMIT now accepted as TopK pattern |

- **0 queries are blocked** by unsupported SQL features.
- **Q15** additionally requires inlining its `CREATE VIEW` as a CTE or
  subquery (pg_trickle requires a single SELECT statement).
- No TPC-H query uses GROUPING SETS, NATURAL JOIN, or recursive CTEs.

### Detailed Query Matrix

| # | Name | Key SQL Features | Blocked? |
|---|------|-----------------|----------|
| Q1 | Pricing Summary | GROUP BY, SUM/AVG/COUNT, WHERE (date) | No |
| Q2 | Minimum Cost Supplier | Correlated scalar subquery (MIN), 8-table join | TopK |
| Q3 | Shipping Priority | 3-table join, GROUP BY, SUM | TopK |
| Q4 | Order Priority Checking | EXISTS subquery, GROUP BY, COUNT | No |
| Q5 | Local Supplier Volume | 6-table join, GROUP BY, SUM | No |
| Q6 | Forecasting Revenue | Single-table SUM, WHERE filters | No |
| Q7 | Volume Shipping | 6-table join, CASE WHEN, SUM | No |
| Q8 | National Market Share | 8-table join, CASE WHEN, subquery in FROM | No |
| Q9 | Product Type Profit | 6-table join, expressions, LIKE | No |
| Q10 | Returned Item Reporting | 4-table join, GROUP BY, SUM | TopK |
| Q11 | Important Stock ID | HAVING with scalar subquery, 3-table join | No |
| Q12 | Shipping Modes | SUM(CASE WHEN), IN, BETWEEN | No |
| Q13 | Customer Distribution | LEFT OUTER JOIN, nested GROUP BY, subquery in FROM | No |
| Q14 | Promotion Effect | Conditional SUM ratio | No |
| Q15 | Top Supplier | View → inline as CTE; MAX subquery | Inline view |
| Q16 | Parts/Supplier | COUNT(DISTINCT), NOT IN subquery, NOT LIKE | No |
| Q17 | Small-Quantity Revenue | Scalar subquery (AVG), 2-table join | No |
| Q18 | Large Volume Customer | IN subquery with HAVING, 3-table join | TopK |
| Q19 | Discounted Revenue | Complex OR/AND WHERE, SUM | No |
| Q20 | Potential Promotion | Nested IN subqueries (2 levels) | No |
| Q21 | Suppliers Waiting | EXISTS + NOT EXISTS, multi-join | TopK |
| Q22 | Global Sales Opportunity | NOT EXISTS, scalar subquery, SUBSTRING | No |

### SQL Feature Coverage by TPC-H

| SQL Feature | TPC-H Queries Using It | pg_trickle Support |
|-------------|------------------------|-------------------|
| INNER JOIN | Q2,Q3,Q5,Q7–Q12,Q15,Q17–Q21 | ✅ Full |
| LEFT OUTER JOIN | Q13 | ✅ Full |
| Multi-table join (3+) | Q2,Q3,Q5,Q7–Q11,Q18,Q20,Q21 | ✅ Full |
| GROUP BY | Q1,Q3–Q5,Q7–Q13,Q16,Q18,Q22 | ✅ Full |
| SUM | Q1,Q3,Q5–Q10,Q12,Q14,Q17–Q19 | ✅ Algebraic |
| AVG | Q1,Q17,Q22 | ✅ Algebraic |
| COUNT / COUNT(DISTINCT) | Q1,Q4,Q13,Q16,Q21 | ✅ Algebraic / ref-counted |
| MIN / MAX | Q2, Q15 | ✅ Semi-algebraic |
| HAVING | Q11,Q16,Q18 | ✅ Full |
| CASE WHEN | Q7,Q8,Q12,Q14,Q22 | ✅ Full |
| EXISTS / NOT EXISTS | Q4,Q21,Q22 | ✅ Semi-join / anti-join |
| IN / NOT IN subquery | Q12,Q16,Q18,Q20 | ✅ Semi-join / anti-join |
| Scalar subquery | Q2,Q11,Q15,Q17,Q22 | ✅ Full |
| Subquery in FROM | Q8,Q13,Q15,Q22 | ✅ Full |
| BETWEEN | Q1,Q3–Q6,Q12,Q15,Q20 | ✅ Full |
| LIKE / NOT LIKE | Q9,Q13,Q16 | ✅ Full |
| LIMIT (TopK) | Q2,Q3,Q10,Q18,Q21 | ✅ Supported (ORDER BY + LIMIT) |

### Refresh Functions

| Function | Description | CDC Path |
|----------|-------------|----------|
| RF1 | Bulk INSERT into `orders` + `lineitem` | Triggers capture → differential refresh |
| RF2 | DELETE from `orders` + `lineitem` by key | Triggers capture → differential refresh |

Both RF1 and RF2 exercise pg_trickle's core pipeline. No special handling
needed — INSERTs and DELETEs are the native change types that CDC captures.

### Proposed Test Plan

**Phase T1-A: Correctness (Scale Factor 1, ~1 GB)**

1. Load TPC-H SF-1 data using `dbgen`.
2. Create 22 stream tables (one per query, with LIMIT/ORDER BY removed).
3. Perform initial FULL refresh for each.
4. Run RF1 (inserts) + RF2 (deletes) — one cycle.
5. Refresh all stream tables in DIFFERENTIAL mode.
6. Compare every stream table's contents against a fresh FULL refresh.
7. **Pass criterion:** zero row differences across all 22 queries.

**Phase T1-B: Performance (Scale Factors 1, 10, 100)**

1. For each scale factor:
   a. Load data, create stream tables, initial FULL refresh.
   b. Run 10 RF1+RF2 cycles with DIFFERENTIAL refresh.
   c. Run 10 RF1+RF2 cycles with FULL refresh.
   d. Record wall-clock time per refresh per query.
2. **Output:** Speedup ratio table (FULL / DIFFERENTIAL) × query × scale factor.
3. Integrate with the existing `e2e_bench_tests.rs` framework.

**Phase T1-C: Sustained Churn**

1. SF-10. Run 100 RF1+RF2 cycles with DIFFERENTIAL refresh.
2. After every 10th cycle, verify correctness against FULL refresh.
3. Record cumulative drift, memory usage, change buffer table sizes.
4. **Pass criterion:** zero cumulative drift; change buffers stay bounded.

### Files

| File | Purpose |
|------|---------|
| `tests/tpch/schema.sql` | TPC-H DDL (8 tables with PKs) |
| `tests/tpch/queries/` | 22 `.sql` files (LIMIT/ORDER BY stripped) |
| `tests/tpch/load.sh` | `dbgen` + `COPY` loader for a given SF |
| `tests/e2e_tpch_tests.rs` | Correctness + benchmark test harness |

---

## Suite 2: sqllogictest — Broad Correctness

### Why sqllogictest

Originally from SQLite, sqllogictest contains **millions** of SQL statements
with expected results. CockroachDB, DuckDB, and DataFusion all maintain forks.
Its value is **breadth** — covering SQL patterns no developer would think to
write manually. For pg_trickle, the key test is: does a stream table in
DIFFERENTIAL mode produce the same result as FULL mode for every query?

**URL:** https://www.sqlite.org/sqllogictest/  
**Rust runner:** https://github.com/risinglightdb/sqllogictest-rs

### Approach

We do not need to run the full sqllogictest corpus verbatim. Instead, we use
it as a **query generator**:

1. **Filter** the corpus for queries that use pg_trickle-supported features
   (joins, aggregates, subqueries, etc.) and exclude unsupported ones
   (LIMIT, GROUPING SETS, etc.).
2. For each qualifying query Q over tables T₁…Tₙ:
   a. Create base tables, populate with seed data.
   b. Create stream table ST with defining query Q in DIFFERENTIAL mode.
   c. Initial FULL refresh → record contents as `expected_full`.
   d. Apply a small random change set to T₁…Tₙ.
   e. DIFFERENTIAL refresh → record contents as `actual_diff`.
   f. FULL refresh (from scratch) → record contents as `expected_after`.
   g. **Assert:** `actual_diff == expected_after`.
3. **Pass criterion:** zero mismatches across the filtered corpus.

### Scope

Focus on the `test/index/random/` and `test/select/` directories which contain
the highest density of relevant analytical queries. Skip:
- Tests that are SQLite-specific (type affinity, etc.)
- Tests requiring LIMIT/OFFSET
- Tests with non-deterministic functions

### Estimated Query Count

After filtering: ~5,000–15,000 qualifying queries. Each adds ~100 ms of
test time at small scale. Budget: 10–30 minutes for full correctness sweep.

### Files

| File | Purpose |
|------|---------|
| `tests/sqllogictest/filter.py` | Filter corpus → qualifying queries |
| `tests/sqllogictest/runner.rs` | Custom harness: create ST → mutate → compare |
| `tests/e2e_slt_tests.rs` | Top-level test entry point |

---

## Suite 3: Join Order Benchmark (JOB) — Real-World Data

### Why JOB

The Join Order Benchmark uses the **real IMDB dataset** (~3.6 GB) with 113
analytically complex queries, some joining 10+ tables. Its value is testing
delta correctness under **realistic data skew** — foreign-key distributions
that are far from uniform, unlike synthetic TPC-H data.

**Paper:** Leis et al., "How Good Are Query Optimizers, Really?" (PVLDB 2015)  
**URL:** https://github.com/gregrahn/join-order-benchmark

### Query Compatibility

JOB queries use INNER JOINs, WHERE with equality/range/LIKE, and minimal
aggregation. Most require only ORDER BY / LIMIT removal. No GROUPING SETS,
recursive CTEs, or window functions. Estimated compatibility: **110+/113
queries** with trivial modifications.

### Proposed Test Plan

1. Load IMDB data (CSV import).
2. Create stream tables for the 113 queries.
3. Apply targeted mutations (INSERT/UPDATE/DELETE on high-skew tables like
   `cast_info`, `movie_info`).
4. DIFFERENTIAL refresh → compare against FULL refresh.
5. **Focus metric:** correctness under skewed join cardinalities.

### Files

| File | Purpose |
|------|---------|
| `tests/job/schema.sql` | IMDB DDL (21 tables with PKs) |
| `tests/job/queries/` | 113 `.sql` files (adapted) |
| `tests/e2e_job_tests.rs` | Correctness test harness |

---

## Suite 4: Feldera / DBSP Benchmarks — Theory Peer Comparison

### Why Feldera

Since pg_trickle is grounded in DBSP theory, comparing against Feldera (the
reference DBSP implementation) on the same workloads provides a meaningful
performance baseline. Feldera maintains benchmarks for Nexmark (streaming
auction events) and TPC-H variants.

**URL:** https://github.com/feldera/feldera/tree/main/benchmark

### Approach

This is **not** an apples-to-apples comparison (Feldera is a standalone
streaming engine; pg_trickle runs inside PostgreSQL). The goal is to
understand the **overhead of the PostgreSQL execution model** vs. a dedicated
dataflow runtime, and to identify queries where pg_trickle's delta SQL
generation is suboptimal.

### Nexmark Benchmark

Nexmark models an online auction system with 3 event types (Person, Auction,
Bid) and 22 queries. Several Nexmark queries use window functions and
time-based grouping. Relevant subset for pg_trickle:

| Query | Features | Compatible? |
|-------|----------|-------------|
| Q0 | Passthrough (no-op) | ✅ |
| Q1 | Projection + arithmetic | ✅ |
| Q2 | Filter | ✅ |
| Q3 | Join + filter | ✅ |
| Q4 | Join + GROUP BY + AVG + MAX | ✅ |
| Q5 | Window: COUNT per time window | ✅ (window function) |
| Q6 | Join + AVG + window | ✅ |
| Q7 | MAX in time window | ✅ |
| Q8 | Join (person-auction within window) | ✅ |
| Q9–Q22 | Various combinations | Most compatible |

### Proposed Test Plan

1. Adapt Nexmark data generator for PostgreSQL tables.
2. Create stream tables for compatible Nexmark queries.
3. Measure:
   a. Throughput (events/sec processed per refresh cycle).
   b. Latency (time from event commit to reflected in stream table).
   c. Compare published Feldera numbers on same hardware class.
4. Publish results in `docs/BENCHMARK.md` as a separate section.

### Files

| File | Purpose |
|------|---------|
| `tests/nexmark/schema.sql` | Nexmark DDL (3 tables) |
| `tests/nexmark/generator.sql` | Data generation via `generate_series` |
| `tests/nexmark/queries/` | Adapted Nexmark queries |
| `tests/e2e_nexmark_tests.rs` | Benchmark harness |

---

## Implementation Roadmap

| Phase | Suite | Effort | Priority | Value |
|-------|-------|--------|----------|-------|
| **T1** | TPC-H correctness (SF-1) | 2–3 days | **P0** | Industry-standard validation; catches delta bugs in joins + aggregates |
| **T2** | TPC-H performance (SF-1/10/100) | 2–3 days | **P1** | Quantified speedup numbers for README and marketing |
| **T3** | sqllogictest (filtered corpus) | 3–5 days | **P1** | Maximum breadth; confident correctness claim |
| **T4** | JOB correctness | 2 days | **P2** | Real-world data skew validation |
| **T5** | Feldera/Nexmark comparison | 3–5 days | **P2** | Competitive positioning; identifies optimization gaps |

**Recommended start:** T1 → T2 → T3. These three phases provide the highest
signal-to-effort ratio and produce publishable benchmark numbers.

---

## Infrastructure Requirements

### Docker

All test suites run inside the existing E2E Testcontainers infrastructure.
TPC-H at SF-100 requires ~100 GB disk and ~16 GB RAM; run on CI with a
`large` runner or locally.

### Data Generation

| Suite | Tool | Data Size |
|-------|------|-----------|
| TPC-H SF-1 | `dbgen` (C tool, compiles in seconds) | ~1 GB |
| TPC-H SF-10 | `dbgen` | ~10 GB |
| TPC-H SF-100 | `dbgen` | ~100 GB |
| sqllogictest | Corpus download (SQLite repo) | ~50 MB |
| JOB / IMDB | CSV download from ftp.fu-berlin.de | ~3.6 GB |
| Nexmark | SQL `generate_series` | Configurable |

### CI Integration

- **T1 (correctness):** Add to CI as a weekly scheduled job (SF-1, ~10 min).
- **T2 (performance):** Nightly job on dedicated hardware; results tracked
  in a benchmark history file.
- **T3 (sqllogictest):** Weekly; flag regressions via exit code.
- **T4/T5:** Manual or monthly; results published to `docs/BENCHMARK.md`.

### Test Tagging

All external-suite tests should be `#[ignore]` (like existing benchmarks)
and gated behind a `just` target:

```bash
just test-tpch          # TPC-H correctness (SF-1)
just bench-tpch         # TPC-H performance (SF-1/10/100)
just test-slt           # sqllogictest filtered corpus
just test-job           # Join Order Benchmark
just bench-nexmark      # Nexmark comparison benchmarks
```