# Full E2E Test Suite — Deep Evaluation Report

> **Date:** 2025-03-16
> **Scope:** 18 full-E2E-only test files (222 tests, ~11,000 lines) requiring
>            the custom Docker image with the compiled extension
> **Goal:** Assess coverage confidence and identify mitigations to harden the suite

---

## Implementation Status

> **Updated:** 2026-03-17
> **Branch:** `test-evals-full-e2e`

### Completed Mitigations

| Priority | Item | Status | Files Changed |
|----------|------|--------|---------------|
| P0-1 | WAL CDC data capture multiset assertions | ✅ Done | `e2e_wal_cdc_tests.rs` |
| P0-2 | Partition tests multiset assertions | ✅ Done | `e2e_partition_tests.rs` |
| P0-3 | DDL event post-reinit data assertions | ✅ Done | `e2e_ddl_event_tests.rs` |
| P0-4 | Circular ST convergence data assertions | ✅ Done | `e2e_circular_tests.rs` |
| P1-1 | Fix RLS superuser bypass in test | ✅ Done | `e2e_rls_tests.rs` |
| P1-2 | Add multiset to append-only fallback tests | ✅ Done | `e2e_append_only_tests.rs` |
| P1-3 | Add multiset to cascade regression tests 3 and 6 | ✅ Done | `e2e_cascade_regression_tests.rs` |
| P1-4 | Add multiset to bootstrap gating refresh tests 12 and 17 | ✅ Done | `e2e_bootstrap_gating_tests.rs` |
| P2-1 | Benchmark smoke assertions | ✅ Done | `e2e_bench_tests.rs` |
| P2-2 | Add multiset after ALTER QUERY | ✅ Done | `e2e_alter_query_tests.rs` |
| P2-3 | Upgrade survival multiset | ✅ Done | `e2e_upgrade_tests.rs` |
| P2-4 | Non-convergence guaranteed divergence | ✅ Done | `e2e_circular_tests.rs` |
| P3-1 | Cascade ad-hoc to multiset | ✅ Done | `e2e_cascade_regression_tests.rs` |
| P3-2 | DELETE/UPDATE in bootstrap gating | ✅ Done | `e2e_bootstrap_gating_tests.rs` |
| P3-3 | Standardize bgworker multiset | ✅ Done | `e2e_bgworker_tests.rs` |

#### P0-1 Details (WAL CDC)
Added `assert_st_matches_query` to four tests:
- `test_wal_cdc_captures_insert` — verifies all inserted rows decoded correctly
- `test_wal_cdc_captures_update` — verifies update reflected via WAL pipeline
- `test_wal_cdc_captures_delete` — verifies only kept rows remain
- `test_wal_fallback_on_missing_slot` — verifies no data loss after fallback

#### P0-2 Details (Partitions)
Added `assert_st_matches_query` to six tests:
- `test_partition_range_full_refresh` — row-level correctness for RANGE + FULL
- `test_partition_range_differential_refresh` — correctness after I/U/D across partitions
- `test_partition_list_source` — aggregated result correctness for LIST partition
- `test_partition_hash_source` — no row loss/corruption for HASH partition
- `test_partition_with_aggregation` — full GROUP BY result over both partitions
- `test_partition_differential_with_aggregation` — GROUP BY result after cross-partition INSERT

#### P0-3 Details (DDL Events)
Added post-reinit data assertions to five tests:
- `test_function_change_marks_st_for_reinit` — refreshes after replacement, verifies new function body applies
- `test_add_column_on_source_st_still_functional` — multiset after ADD COLUMN refresh
- `test_add_column_unused_st_survives_refresh` — multiset verifies unused column excluded
- `test_drop_unused_column_st_survives` — multiset after DROP COLUMN refresh
- `test_alter_column_type_triggers_reinit` — refreshes after type change, verifies correct data

#### P0-4 Details (Circular)
Added to `test_circular_monotone_cycle_converges`:
- Row count assertion: ≥6 pairs for transitive closure of 3-node chain
- Existence assertion: pair `(1,4)` must exist — requires 2+ fixpoint iterations

#### P1-1 Details (RLS)
Fixed `test_rls_on_stream_table_filters_reads`:
- Uses `db.pool.begin()` + `SET LOCAL ROLE rls_reader` in a transaction
- Asserts `count = 2` (only tenant_id=10 rows visible) as restricted role
- Existing superuser assertion `count = 4` retained

#### P1-2 Details (Append-Only)
Added `assert_st_matches_query` to three tests:
- `test_append_only_fallback_on_delete` — verifies row absent after DELETE + MERGE fallback
- `test_append_only_fallback_on_update` — verifies no stale old-value rows remain
- `test_alter_enable_append_only` — verifies correct data after INSERT via append-only path

#### P1-3 Details (Cascade Regression)
Added `assert_st_matches_query` to two tests:
- `test_st_on_st_cascade_propagates_delete` — compares `order_report` against its defining query post-DELETE
- `test_three_layer_cascade_insert_propagates` — compares `big_categories` against `category_flags WHERE is_big = true` post-INSERT

#### P1-4 Details (Bootstrap Gating)
Added `assert_st_matches_query` to two tests:
- `test_manual_refresh_works_through_full_lifecycle` — verifies all 3 rows correct after full gate/ungate/re-gate cycle
- `test_manual_refresh_not_blocked_by_gate` — verifies both rows correct after gated manual refresh

### Remaining Work

| Priority | Item | Status |
|----------|------|--------|
| P2-1 | Add smoke correctness check to benchmarks (32 tests) | Not started |
| P2-2 | Add ALTER QUERY + DML cycle tests | Not started |
| P2-3 | Add upgrade chain data validation | Not started |
| P2-4 | Add non-convergence test with guaranteed divergence | Not started |
| P3-1 | Consolidate cascade value checks to multiset | Not started |
| P3-2 | Add DELETE/UPDATE to bootstrap gating tests | Not started |
| P3-3 | Standardise bgworker test assertions | Not started |

---

## Table of Contents

1. [Implementation Status](#implementation-status)
2. [Executive Summary](#executive-summary)
3. [Test Infrastructure](#test-infrastructure)
4. [Per-File Analysis](#per-file-analysis)
5. [Cross-Cutting Findings](#cross-cutting-findings)
6. [Priority Mitigations](#priority-mitigations)
7. [Appendix: Coverage Matrix](#appendix-coverage-matrix)
5. [Priority Mitigations](#priority-mitigations)
6. [Appendix: Coverage Matrix](#appendix-coverage-matrix)

---

## Executive Summary

The full E2E test suite consists of **222 test functions** across 18 files
(~11,000 lines). These tests require the custom Docker image built from
`tests/Dockerfile.e2e` with the compiled extension, background worker,
`shared_preload_libraries`, and GUC support. They run via `just test-e2e`
(CI: push to main + daily schedule + manual dispatch; **skipped on PRs**).

**Confidence level: MODERATE (≈65%)**

### Strength Distribution

| Verdict | Files | Tests | % of Total |
|---------|-------|-------|-----------|
| STRONG | 4 | 40 | 18% |
| ADEQUATE | 9 | 122 | 55% |
| WEAK | 5 | 60 | 27% |

### Files Using `assert_st_matches_query` (Multiset Comparison)

| File | Calls | Tests w/ Multiset |
|------|-------|-------------------|
| `e2e_differential_gaps_tests` | 39 | 13/13 (100%) |
| `e2e_multi_cycle_tests` | 21 | 6/9 (67%) |
| `e2e_guc_variation_tests` | 10 | 8/13 (62%) |
| `e2e_dag_autorefresh_tests` | 8 | 4/5 (80%) |
| `e2e_bgworker_tests` | 2 | 2/9 (22%) |
| `e2e_user_trigger_tests` | 2 | 2/11 (18%) |
| `e2e_alter_query_tests` | 1 | 1/15 (7%) |
| `e2e_upgrade_tests` | 1 | 1/14 (7%) |
| **8 files with ZERO** | 0 | 0/138 (0%) |
| **TOTAL** | **84** | **37/222 (17%)** |

**83% of full-E2E tests do NOT use multiset comparison for data correctness.**

### Strengths

| Area | Assessment |
|------|-----------|
| UDA + nested OR differential gaps | **Exceptional** — 13/13 tests with multiset, full DML cycles |
| Multi-cycle cumulative correctness | **Strong** — 5+ DML cycles with multiset at each checkpoint |
| DAG autorefresh cascades | **Strong** — 3-4 layer topologies with multiset at all layers |
| GUC variation correctness | **Strong** — 8 GUC configurations validated with multiset |
| DDL event detection | **Good** — 14 tests covering ADD/DROP/ALTER column, function changes, RENAME |
| Bootstrap gating lifecycle | **Good** — 18 tests covering full gate → ungate → re-gate cycle |

### Weaknesses

| Severity | Finding | Impact |
|----------|---------|--------|
| **CRITICAL** | 10 files (138 tests) have ZERO multiset comparison | Data corruption undetectable in partition, RLS, WAL CDC, circular, DDL event, append-only, bootstrap gating, cascade regression, bench, and ergonomics tests |
| **HIGH** | Partition tests rely on `db.count()` only | All 5 partition types (RANGE/LIST/HASH + aggregation) unverified for row correctness |
| **HIGH** | WAL CDC data capture tests use count only | WAL INSERT/UPDATE/DELETE correctness never verified at row level |
| **HIGH** | Circular ST data correctness never verified | Cycle convergence could produce wrong data; only metadata (scc_id, status) checked |
| **MEDIUM** | Cascade regression tests miss multiset on 3-layer chains | Test 6 (3-layer) only counts; tests 2, 7 use partial data checks |
| **MEDIUM** | Benchmark tests (32) have zero correctness assertions | Performance measured on potentially incorrect results |
| **MEDIUM** | RLS tests don't verify row-level filtering | Test 3 runs as superuser (bypasses RLS); no restricted-user query |
| **LOW** | Ergonomics tests are metadata-only | By design — API contract tests, not data tests |

---

## Test Infrastructure

### Full E2E Docker Image

**Docker image:** Built from `tests/Dockerfile.e2e`, includes:
- PostgreSQL 18.x with the compiled `pg_trickle` extension
- `shared_preload_libraries = 'pg_trickle'` configured
- Background worker active
- All GUCs available

**Test harness:** `tests/e2e/mod.rs` provides `TestDb` with:
- `create_st()` / `refresh_st()` / `drop_st()` — extension function wrappers
- `assert_st_matches_query(st_name, query)` — EXCEPT-based multiset comparison
  that auto-discovers columns, handles json→text casts, and filters internal
  `__pgt_*` columns. Supports EXCEPT/INTERSECT set-operation visibility filters.
- `wait_for_scheduler()` — polls until background worker completes a refresh
- Full `sqlx::PgPool` access for arbitrary SQL

### Why These Tests Need the Full Image

These 18 files test capabilities that require the compiled extension binary:
- Background worker / scheduler (bgworker, dag_autorefresh)
- GUC variables (guc_variation, bootstrap_gating)
- DDL event triggers (ddl_event)
- WAL-based CDC with logical replication (wal_cdc)
- Extension upgrade paths (upgrade)
- Row-level security interaction (rls)
- Partition ATTACH/DETACH triggers (partition)
- Circular dependency / SCC detection (circular)
- Append-only optimization (append_only)
- User-defined trigger interaction (user_trigger)
- CDC benchmarks (bench)

---

## Per-File Analysis

### 1. `e2e_alter_query_tests.rs` — 578 lines, 15 tests

**Purpose:** Validates ALTER QUERY operations (changing a stream table's
defining query in-place).

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_alter_query_same_schema` | Same-schema query change with WHERE clause | ✅ **STRONG** — `assert_st_matches_query` |
| `test_alter_query_same_schema_differential` | ALTER on DIFFERENTIAL mode ST | ⚠️ Count only |
| `test_alter_query_add_column` | Adding a column to the query | ⚠️ Spot-checks one value |
| `test_alter_query_remove_column` | Removing a column | ⚠️ Column existence only |
| `test_alter_query_type_change_compatible` | INT → BIGINT type change | ⚠️ Status + count |
| `test_alter_query_type_change_incompatible` | INT → TEXT triggers rebuild | ⚠️ OID changed, count only |
| `test_alter_query_change_sources` | Change to different source tables | ⚠️ Dependency count only |
| `test_alter_query_remove_source` | Remove a source dependency | ⚠️ Dependency check |
| `test_alter_query_pgt_count_transition` | Flat → aggregate query transition | ⚠️ Count only |
| `test_alter_query_with_mode_change` | Simultaneous query + mode change | ⚠️ Status + count |
| `test_alter_query_invalid_query` | Invalid query rejected | ✅ Error path |
| `test_alter_query_cycle_detection` | Cyclic deps rejected | ✅ Error path |
| `test_alter_query_view_inlining` | Views inlined in catalog | ⚠️ Catalog check |
| `test_alter_query_oid_stable_same_schema` | OID preserved for same-schema ALTER | ✅ OID comparison |
| `test_alter_query_catalog_updated` | Catalog query updated | ✅ Query text comparison |

**Verdict: ADEQUATE**

**Gaps:**
- Only 1/15 tests uses multiset comparison
- After ALTER to aggregate/join queries, data correctness not verified
- No ALTER + DML cycle (INSERT → ALTER → refresh → verify)

---

### 2. `e2e_append_only_tests.rs` — 342 lines, 10 tests

**Purpose:** Validates the append-only optimization (INSERT-only fast path)
and fallback to MERGE on UPDATE/DELETE.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_append_only_basic_insert_path` | Flag set, row count correct | ⚠️ Count only |
| `test_append_only_data_correctness` | Multi-cycle correctness | ⚠️ SUM aggregate only |
| `test_append_only_fallback_on_delete` | DELETE triggers fallback to MERGE | ⚠️ Flag check + count |
| `test_append_only_fallback_on_update` | UPDATE triggers fallback | ⚠️ Spot-checks one value |
| `test_alter_enable_append_only` | ALTER to enable append_only | ⚠️ Flag + count |
| `test_append_only_rejected_for_full_mode` | FULL mode rejects append_only | ✅ Error validation |
| `test_append_only_rejected_for_immediate_mode` | IMMEDIATE mode rejects | ✅ Error validation |
| `test_append_only_rejected_for_keyless_source` | Keyless table rejects | ✅ Error validation |
| `test_alter_append_only_rejected_for_full_mode` | ALTER rejects on FULL | ✅ Error validation |
| `test_append_only_no_data_cycle` | No-data cycle is idempotent | ⚠️ Count only |

**Verdict: ADEQUATE**

**Key gap:** Zero multiset comparisons. After fallback from append-only to
MERGE, data correctness should be verified with `assert_st_matches_query`.
Test 2 uses SUM for basic verification but can't detect wrong individual rows.

---

### 3. `e2e_bench_tests.rs` — 2,156 lines, 32 tests (all `#[ignore]`)

**Purpose:** Performance benchmarks measuring refresh latency across query
types (scan, filter, aggregate, join, window, lateral, CTE, UNION), sizes
(10K–100K rows), and change rates (1%–50%).

All 32 tests are `#[ignore]`-gated and timer-based. They measure TPS, p50/p99
latency, and overhead percentages.

| Test Category | Count | Assertion Type |
|--------------|-------|---------------|
| Scan benchmarks | 9 | ⚠️ Timing only |
| Filter/aggregate/join/window benchmarks | 12 | ⚠️ Timing only |
| No-data refresh latency | 1 | ⚠️ avg < 10ms target |
| Index overhead | 1 | ⚠️ Overhead % |
| CDC trigger overhead | 2 | ⚠️ Timing comparison |
| Statement vs row CDC | 2 | ⚠️ Timing comparison |
| Concurrent writers | 1 | ⚠️ Throughput |
| Full matrix sweeps | 4 | ⚠️ Timing aggregation |

**Verdict: WEAK (by design — benchmarks, not correctness tests)**

**Gap:** No data correctness assertions anywhere. Row counts are logged but
never asserted. If a DVM bug causes incorrect results, benchmarks will still
report normal timing.

**Recommendation:** Add a smoke-test assertion at the end of each benchmark
variant: after the final cycle, call `assert_st_matches_query` once. This
adds negligible overhead to the benchmark but catches correctness regressions.

---

### 4. `e2e_bgworker_tests.rs` — 570 lines, 9 tests

**Purpose:** Validates the background worker / scheduler: extension loading,
GUC registration, auto-refresh, differential mode, history records, catalog
metadata updates.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_extension_loads_with_shared_preload` | Extension present in pg_extension | ✅ Setup validation |
| `test_gucs_registered` | 8 GUC defaults correct | ✅ 8 SHOW comparisons |
| `test_gucs_can_be_altered` | GUCs changeable via ALTER SYSTEM | ✅ 5 ALTER + SHOW |
| `test_auto_refresh_within_schedule` | Scheduler fires within threshold | ⚠️ Count only |
| `test_auto_refresh_differential_mode` | Differential auto-refresh correct | ✅ **STRONG** — `assert_st_matches_query` |
| `test_scheduler_writes_refresh_history` | History records created | ⚠️ History count |
| `test_auto_refresh_differential_with_cdc` | CDC + differential auto-refresh | ✅ **STRONG** — `assert_st_matches_query` |
| `test_scheduler_refreshes_multiple_healthy_sts` | Multiple STs refreshed in one tick | ⚠️ Count checks |
| `test_auto_refresh_updates_catalog_metadata` | Timestamps and error counts updated | ⚠️ Metadata checks |

**Verdict: ADEQUATE**

**Strengths:** Tests 5 and 7 use multiset comparison for real correctness.
GUC validation thorough.

**Gaps:** Tests 4 and 8 (auto-refresh count, multiple STs) should use multiset.

---

### 5. `e2e_bootstrap_gating_tests.rs` — 637 lines, 18 tests

**Purpose:** Validates the bootstrap gating feature (source gates that block
scheduler refreshes during initial data loads).

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_gate_source_inserts_gate_record` | Gate record created | ⚠️ Metadata |
| `test_source_gates_returns_gated_source` | Function returns gated source | ⚠️ Metadata |
| `test_ungate_source_clears_gate` | Ungate sets gated=false | ⚠️ Metadata |
| `test_gate_source_is_idempotent` | Double-gate produces one record | ⚠️ Count |
| `test_regate_after_ungate` | Re-gate after ungate works | ⚠️ Metadata |
| `test_gate_source_nonexistent_table_errors` | Nonexistent table → error | ✅ Error path |
| `test_source_gates_empty_by_default` | No gates initially | ⚠️ Count |
| `test_multiple_sources_gated` | Multiple sources can be gated | ⚠️ Count |
| `test_idempotent_gate_refreshes_timestamp` | Double-gate refreshes gated_at | ⚠️ Timestamp |
| `test_idempotent_gate_preserves_state` | Double-gate preserves state | ⚠️ Metadata |
| `test_regate_lifecycle_clears_ungated_at` | Re-gate clears ungated_at | ⚠️ Metadata |
| `test_manual_refresh_works_through_full_lifecycle` | Manual refresh through gate cycle | ⚠️ Count (1→2→3) |
| `test_bootstrap_gate_status_returns_expected_columns` | Status function columns | ⚠️ Column check |
| `test_bootstrap_gate_status_ungated_duration` | Duration for ungated sources | ⚠️ Metadata |
| `test_bootstrap_gate_status_affected_stream_tables` | Affected STs listed | ⚠️ String contains |
| `test_bootstrap_gate_status_empty_by_default` | No gate status initially | ⚠️ Count |
| `test_manual_refresh_not_blocked_by_gate` | Manual refresh bypasses gates | ⚠️ Count |
| `test_scheduler_logs_skip_when_source_gated` | Scheduler SKIPs gated sources | ✅ History action/status |

**Verdict: ADEQUATE**

**Gaps:** Zero multiset comparisons. Tests 12 and 17 (manual refresh) should
verify data content, not just count increments.

---

### 6. `e2e_cascade_regression_tests.rs` — 796 lines, 8 tests

**Purpose:** Regression tests for ST-on-ST cascade behavior: propagation of
INSERT/UPDATE/DELETE through chained stream tables, zero-row refresh timestamp
stability, and correct dependency type tracking.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_cdc_triggers_not_counted_as_user_triggers` | CDC trigger exclusion in detection query | ✅ Before/after logic |
| `test_st_on_st_cascade_propagates_insert` | INSERT cascades through ST chain | ✅ Value comparison (300→450) |
| `test_st_on_st_cascade_propagates_delete` | DELETE cascades through ST chain | ⚠️ EXISTS check only |
| `test_zero_row_differential_preserves_data_timestamp` | 0-row refresh doesn't bump timestamp | ✅ **STRONG** — timestamp equality regression |
| `test_no_spurious_cascade_after_noop_upstream_refresh` | No-op upstream doesn't cascade | ✅ **STRONG** — timestamp stability |
| `test_three_layer_cascade_insert_propagates` | 3-layer INSERT cascade | ⚠️ Count only |
| `test_three_layer_cascade_update_propagates` | 3-layer UPDATE cascade | ✅ Category value comparison |
| `test_st_on_st_dependency_is_stream_table_type` | Dependency recorded as STREAM_TABLE | ✅ Type string comparison |

**Verdict: ADEQUATE to STRONG**

**Strengths:** Tests 2, 4, 5, 7 have genuine data validation (value comparisons,
timestamp equality). Regression-focused.

**Gaps:**
- Zero use of `assert_st_matches_query` — tests do ad-hoc data checks
- Test 3 (DELETE cascade) only checks EXISTS, not full data
- Test 6 (3-layer INSERT) only checks count

---

### 7. `e2e_circular_tests.rs` — 562 lines, 6 tests

**Purpose:** Validates circular/cyclic stream table dependencies using SCC
(strongly connected component) detection, monotonicity checks, convergence,
and drop cleanup.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_circular_monotone_cycle_converges` | Monotone cycle creation + SCC ID | ⚠️ Metadata only |
| `test_circular_nonmonotone_cycle_rejected` | Non-monotone cycle rejected | ✅ Error message |
| `test_circular_convergence_records_iterations` | Iteration count recorded | ⚠️ iterations ≥ 1 (loose) |
| `test_circular_nonconvergence_error_status` | Max iterations → ERROR | ⚠️ Status check (timing-sensitive) |
| `test_circular_drop_member_clears_scc_id` | Drop member clears SCC IDs | ⚠️ Metadata |
| `test_circular_default_rejects_cycles` | allow_circular=false rejects | ✅ Error message |

**Verdict: WEAK**

**Critical gap:** Zero multiset comparisons. All 6 tests validate only metadata
(scc_id, status, iteration count) — none verify that the cyclic stream tables
actually contain correct data after convergence. A cycle that converges to the
wrong fixed point would pass all tests.

---

### 8. `e2e_dag_autorefresh_tests.rs` — 449 lines, 5 tests

**Purpose:** Validates automatic scheduler-driven refresh through multi-layer
DAG topologies.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_autorefresh_3_layer_cascade` | 3-layer cascade auto-refresh | ✅ **STRONG** — `assert_st_matches_query` at all 3 layers |
| `test_autorefresh_diamond_cascade` | Diamond topology auto-refresh | ✅ **STRONG** — multiset on L2 |
| `test_autorefresh_calculated_schedule` | CALCULATED schedule triggers | ✅ **STRONG** — multiset after L1 refresh |
| `test_autorefresh_no_spurious_3_layer` | No spurious cascades on no-op | ✅ Timestamp stability |
| `test_autorefresh_staggered_schedules` | Staggered schedules converge | ✅ **STRONG** — multiset at all 3 layers |

**Verdict: STRONG**

**Exemplary file.** 4/5 tests use `assert_st_matches_query` for full multiset
comparison at every layer of the DAG. Test 4 (no-spurious) appropriately uses
timestamp stability rather than data comparison.

---

### 9. `e2e_ddl_event_tests.rs` — 608 lines, 14 tests

**Purpose:** Validates DDL event trigger reactions: what happens to stream
tables when source tables are altered (ADD/DROP/ALTER column, RENAME, DROP
table, function changes, index creation).

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_drop_source_fires_event_trigger` | DROP source → ST error/cleanup | ⚠️ Status/count |
| `test_alter_source_fires_event_trigger` | ALTER source → ST remains | ⚠️ Count only |
| `test_drop_st_storage_by_sql` | DROP storage → catalog cleanup | ⚠️ Count only |
| `test_rename_source_table` | RENAME source → refresh fails | ✅ Error path |
| `test_function_change_marks_st_for_reinit` | Function change → needs_reinit | ⚠️ Flag check |
| `test_drop_function_marks_st_for_reinit` | DROP function → needs_reinit | ⚠️ Flag check |
| `test_add_column_on_source_st_still_functional` | ADD column (unused) → ST OK | ⚠️ Count only |
| `test_add_column_unused_st_survives_refresh` | ADD + UPDATE → ST refreshes | ⚠️ Count + spot value |
| `test_drop_unused_column_st_survives` | DROP column (unused) → ST OK | ⚠️ Status + count |
| `test_alter_column_type_triggers_reinit` | ALTER TYPE → needs_reinit | ⚠️ Flag check |
| `test_create_index_on_source_is_benign` | CREATE INDEX → no reinit | ⚠️ Flag + count |
| `test_drop_source_with_multiple_downstream_sts` | DROP with 2+ downstream STs | ⚠️ Status checks |
| `test_block_source_ddl_guc_prevents_alter` | block_source_ddl=on blocks ALTER | ✅ Error + DML works |
| `test_add_column_on_joined_source_st_survives` | ADD column on joined source | ⚠️ Status + count |

**Verdict: WEAK**

**Critical gap:** Zero multiset comparisons across all 14 tests. After DDL
changes (ADD/DROP/ALTER column, function replacement), stream table data is
never verified. Tests confirm metadata flags (needs_reinit, status) but not
whether the data is correct after the DDL-triggered reinit/refresh.

---

### 10. `e2e_differential_gaps_tests.rs` — 526 lines, 13 tests

**Purpose:** Validates DVM differential refresh for features that previously
had gaps: user-defined aggregates (UDAs) and nested OR with EXISTS sublinks.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_uda_simple_differential` | UDA INSERT/DELETE/UPDATE cycles | ✅ **STRONG** — multiset after each DML |
| `test_uda_combined_with_builtin` | UDA + COUNT/SUM together | ✅ **STRONG** — multiset |
| `test_uda_auto_mode_resolves_to_differential` | AUTO mode resolves correctly | ✅ **STRONG** — mode + multiset |
| `test_uda_multiple_in_same_query` | Multiple UDAs in one query | ✅ **STRONG** — multiset |
| `test_nested_or_two_exists` | OR with 2 EXISTS sublinks | ✅ **STRONG** — multiset after each DML |
| `test_nested_or_mixed_and_or_under_or` | OR(a OR (b AND EXISTS)) | ✅ **STRONG** — multiset |
| `test_nested_or_cdc_cycle` | Complex OR+EXISTS + full CDC cycle | ✅ **STRONG** — multiset after I/U/D |
| `test_nested_or_demorgan_not_and` | De Morgan NOT(AND+sublink) | ✅ **STRONG** — multiset after I/U/D |
| `test_nested_or_demorgan_and_prefix` | AND prefix + NOT(AND+sublink) | ✅ **STRONG** — multiset |
| `test_uda_with_filter_clause` | UDA with FILTER(WHERE ...) | ✅ **STRONG** — multiset |
| `test_uda_with_order_by_in_agg` | UDA with ORDER BY in aggregate | ✅ **STRONG** — multiset |
| `test_uda_schema_qualified` | Schema-qualified UDA | ✅ **STRONG** — multiset |
| `test_uda_insert_delete_update_full_cycle` | Full lifecycle: I→U→D→revival | ✅ **STRONG** — multiset after each of 6 ops |

**Verdict: STRONG — EXEMPLARY**

**All 13 tests** use `assert_st_matches_query` for full multiset comparison.
Full DML cycles (INSERT, UPDATE, DELETE) with verification at each step. This
is the gold standard for the test suite.

---

### 11. `e2e_guc_variation_tests.rs` — 430 lines, 13 tests

**Purpose:** Validates that non-default GUC configurations produce correct
results.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_guc_prepared_statements_off` | prepared_statements=OFF | ✅ **STRONG** — multiset |
| `test_guc_merge_planner_hints_off` | merge_planner_hints=OFF | ✅ **STRONG** — multiset |
| `test_guc_cleanup_use_truncate_off` | cleanup_use_truncate=OFF | ✅ **STRONG** — multiset |
| `test_guc_merge_work_mem_mb_custom` | merge_work_mem_mb=16 | ✅ **STRONG** — multiset |
| `test_guc_block_source_ddl_on` | block_source_ddl=ON prevents DDL | ✅ **STRONG** — error + multiset |
| `test_guc_differential_max_change_ratio_zero` | max_change_ratio=0.0 | ✅ **STRONG** — mode + multiset |
| `test_guc_combined_non_default` | Multiple GUCs at once | ✅ **STRONG** — multiset |
| `test_guc_max_grouping_set_branches_rejects_over_limit` | CUBE limit exceeded | ✅ Error validation |
| `test_guc_max_grouping_set_branches_allows_within_limit` | CUBE within limit | ⚠️ Creation only |
| `test_guc_max_grouping_set_branches_raised_allows_large_cube` | Raised CUBE limit | ⚠️ Creation only |
| `test_guc_foreign_table_polling_off_rejects_differential` | Foreign table polling rejected | ✅ Error validation |
| `test_guc_foreign_table_polling_full_mode_no_guc_needed` | Foreign table FULL mode | ⚠️ Creation only |
| `test_guc_foreign_table_polling_on_allows_differential` | Foreign table polling enabled | ✅ **STRONG** — multiset after I/D |

**Verdict: STRONG**

**8/13 tests** use multiset comparison. The 5 without it are boundary/error
tests where creation success/failure is the primary assertion. Minor gap:
CUBE limit tests only verify creation, not query result correctness.

---

### 12. `e2e_multi_cycle_tests.rs` — 534 lines, 9 tests

**Purpose:** Validates cumulative correctness across multiple refresh cycles
with different DML operations and cache behaviors.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_multi_cycle_aggregate_differential` | 5 cycles: I→U→D→mixed→no-op | ✅ **STRONG** — multiset after each |
| `test_multi_cycle_join_differential` | 4 JOIN cycles with left/right DML | ✅ **STRONG** — multiset after each |
| `test_multi_cycle_window_differential` | 5 INSERT + 2 DELETE cycles | ✅ **STRONG** — multiset after each |
| `test_multi_cycle_prepared_statement_cache` | 7 cycles, cache survives | ✅ **STRONG** — multiset after each |
| `test_prepared_statements_cleared_after_cache_invalidation` | Cache invalidated on ALTER | ⚠️ Scalar total + cache count |
| `test_multi_cycle_group_elimination_revival` | Group elimination + revival | ✅ **STRONG** — multiset after each |
| `test_ec16_function_body_change_marks_reinit` | Function change → reinit + correct data | ✅ Explicit sum validation (60→70→108) |
| `test_ec16_function_change_full_refresh_recovery` | Function change recovery | ✅ Explicit sum validation (215→836) |
| `test_ec16_no_functions_unaffected` | Unchanged STs unaffected | ⚠️ Flag + count |

**Verdict: STRONG**

**6/9 tests** use multiset comparison with multi-step DML cycles. The EC-16
tests use explicit sum validation which is adequate for verifying new function
logic is applied.

---

### 13. `e2e_partition_tests.rs` — 554 lines, 9 tests

**Purpose:** Validates stream tables built on partitioned source tables
(RANGE, LIST, HASH) and on foreign tables via postgres_fdw.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_partition_range_full_refresh` | RANGE partition + FULL | ⚠️ Count only |
| `test_partition_range_differential_refresh` | RANGE + INSERT/UPDATE/DELETE cycle | ⚠️ Count checks |
| `test_partition_list_source` | LIST partition | ⚠️ Count only |
| `test_partition_hash_source` | HASH partition | ⚠️ Count only |
| `test_partition_attach_triggers_reinit` | ATTACH → needs_reinit | ⚠️ Flag + count |
| `test_partition_detach_triggers_reinit` | DETACH → needs_reinit | ⚠️ Flag + count |
| `test_foreign_table_full_refresh_works` | Foreign table via postgres_fdw | ⚠️ Count only |
| `test_partition_with_aggregation` | Partitioned + GROUP BY | ⚠️ Scalar sum |
| `test_partition_differential_with_aggregation` | Partitioned + GROUP BY + INSERT | ⚠️ Scalar sum |

**Verdict: WEAK**

**Zero multiset comparisons.** All 9 tests rely on `db.count()` or scalar
aggregate checks. Test 2 has a full INSERT/UPDATE/DELETE cycle but never
verifies the actual row content.

---

### 14. `e2e_phase4_ergonomics_tests.rs` — 577 lines, 20 tests

**Purpose:** Validates API ergonomics: manual refresh history, quick_health
view, `create_if_not_exists()`, schedule defaults, removed GUCs, ALTER warnings.

| Test Group | Count | What It Validates | Assertion Quality |
|-----------|-------|-------------------|-------------------|
| ERG-D (refresh history) | 3 | `initiated_by='MANUAL'`, status/end_time | ⚠️ Metadata |
| ERG-E (quick_health) | 3 | View returns correct status | ⚠️ Metadata |
| COR-2 (create_if_not_exists) | 3 | Idempotent creation | ⚠️ Count/status |
| ERG-T1 (schedule defaults) | 5 | 'calculated' default, NULL rejection | ✅ Error + metadata |
| ERG-T2 (removed GUCs) | 2 | Old GUCs properly missing | ✅ Error validation |
| ERG-T3 (ALTER warnings) | 4 | Warnings emitted on mode/query changes | ⚠️ Notice text |

**Verdict: ADEQUATE (by design — API contract tests, not data tests)**

These tests are appropriately metadata-focused. They test the API surface,
not data correctness. No multiset comparison needed.

---

### 15. `e2e_rls_tests.rs` — 453 lines, 9 tests

**Purpose:** Validates Row-Level Security interaction with stream tables:
RLS on source, RLS on ST, change buffer security, trigger SECURITY DEFINER,
and DDL event detection for RLS changes.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_rls_on_source_does_not_filter_stream_table` | RLS on source → ST sees all rows | ⚠️ Count only |
| `test_rls_on_source_differential_mode` | RLS + DIFFERENTIAL + INSERT cycle | ⚠️ Count only |
| `test_rls_on_stream_table_filters_reads` | RLS policy on ST (superuser) | ⚠️ Count only |
| `test_rls_on_stream_table_immediate_mode` | IMMEDIATE + RLS on ST | ⚠️ Count only |
| `test_change_buffer_rls_disabled` | relrowsecurity=false on buffer | ⚠️ Boolean check |
| `test_ivm_trigger_functions_security_definer` | Triggers are SECURITY DEFINER | ⚠️ Boolean + search_path |
| `test_enable_rls_on_source_triggers_reinit` | ENABLE RLS → needs_reinit | ⚠️ Flag check |
| `test_disable_rls_on_source_triggers_reinit` | DISABLE RLS → needs_reinit | ⚠️ Flag check |
| `test_force_rls_on_source_triggers_reinit` | FORCE RLS → needs_reinit | ⚠️ Flag check |

**Verdict: WEAK**

**Zero multiset comparisons.** All tests use count or flag assertions.

**Significant gap:** Test 3 (`test_rls_on_stream_table_filters_reads`) claims
to test RLS filtering but runs as superuser, who bypasses RLS by default.
The test should query as a restricted role to verify that RLS actually filters
rows.

---

### 16. `e2e_upgrade_tests.rs` — 871 lines, 14 tests (7 active, 7 `#[ignore]`)

**Purpose:** Validates extension upgrade paths: schema stability, round-trip
(DROP + CREATE), version consistency, and upgrade chain survival.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_upgrade_catalog_schema_stability` | 31 expected columns present | ✅ **STRONG** — column list |
| `test_upgrade_catalog_indexes_present` | Expected indexes exist | ⚠️ EXISTS checks |
| `test_upgrade_drop_recreate_roundtrip` | DROP CASCADE + CREATE round-trip | ✅ **STRONG** — `assert_st_matches_query` |
| `test_upgrade_extension_version_consistency` | Version matches | ✅ String comparison |
| `test_upgrade_dependencies_schema_stability` | Dependencies schema stable | ⚠️ Column list |
| `test_upgrade_event_triggers_installed` | Event triggers exist | ⚠️ EXISTS |
| `test_upgrade_monitoring_views_present` | Views queryable | ⚠️ Queryability |
| `test_upgrade_chain_new_functions_exist` | (#[ignore]) Functions callable | ⚠️ Existence |
| `test_upgrade_chain_stream_tables_survive` | (#[ignore]) STs survive upgrade | ⚠️ Count only |
| `test_upgrade_chain_views_queryable` | (#[ignore]) Views work post-upgrade | ⚠️ Queryability |
| `test_upgrade_chain_event_triggers_present` | (#[ignore]) Triggers exist | ⚠️ EXISTS |
| `test_upgrade_chain_version_consistency` | (#[ignore]) Version correct | ⚠️ String |
| `test_upgrade_chain_function_parity_with_fresh_install` | (#[ignore]) Function count matches | ⚠️ Count |
| `test_upgrade_schema_additions_from_sql` | All SQL scripts parsed + verified | ✅ **STRONG** — regex-based |

**Verdict: ADEQUATE**

**Strength:** Test 3 (round-trip) uses `assert_st_matches_query`. Test 14
(SQL script verification) is comprehensive.

**Gap:** The 7 `#[ignore]` upgrade chain tests only use count/existence — none
verify data correctness post-upgrade.

---

### 17. `e2e_user_trigger_tests.rs` — 649 lines, 11 tests

**Purpose:** Validates user-defined trigger interaction with stream table
refresh: audit triggers, GUC control, BEFORE trigger modification, and
MERGE vs explicit DML path selection.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_explicit_dml_insert` | Audit on INSERT: NEW captured | ⚠️ Audit field-level |
| `test_explicit_dml_update` | Audit on UPDATE: OLD/NEW captured | ⚠️ Audit field-level |
| `test_explicit_dml_delete` | Audit on DELETE: OLD captured | ⚠️ Audit field-level |
| `test_explicit_dml_no_op_skip` | IS DISTINCT FROM prevents no-op trigger | ⚠️ Count check |
| `test_no_trigger_uses_merge` | No triggers → MERGE path + correct data | ✅ **STRONG** — `assert_st_matches_query` |
| `test_trigger_audit_trail` | Mixed I/U/D + audit + data correctness | ✅ **STRONG** — multiset + audit counts |
| `test_guc_off_suppresses_triggers` | GUC 'off' → audit empty | ⚠️ Audit emptiness |
| `test_guc_auto_detects_triggers` | GUC 'auto' → triggers fire | ⚠️ Audit count |
| `test_guc_on_alias_detects_triggers` | Deprecated 'on' alias works | ⚠️ Audit count |
| `test_full_refresh_suppresses_triggers` | FULL refresh → no row triggers | ⚠️ Audit emptiness |
| `test_before_trigger_modifies_new` | BEFORE trigger modifies NEW value | ⚠️ Scalar value |

**Verdict: ADEQUATE to STRONG**

**Tests 5 and 6** use multiset comparison — test 6 is especially good, combining
audit trail validation with data correctness.

---

### 18. `e2e_wal_cdc_tests.rs` — 729 lines, 17 tests

**Purpose:** Validates WAL-based CDC (logical replication): mode transitions,
INSERT/UPDATE/DELETE capture, fallback to triggers, cleanup on DROP, keyless
table handling, and health checks.

| Test | What It Validates | Assertion Quality |
|------|-------------------|-------------------|
| `test_wal_auto_is_default_cdc_mode` | Default GUC = 'auto' | ⚠️ String |
| `test_wal_level_is_logical` | Container has wal_level=logical | ⚠️ String |
| `test_explicit_wal_override_transitions_even_with_global_trigger` | Force WAL despite trigger GUC | ⚠️ Mode check |
| `test_explicit_trigger_override_blocks_wal_transition` | Force TRIGGER prevents WAL | ⚠️ Mode check |
| `test_wal_transition_lifecycle` | TRIGGER→TRANSITIONING→WAL + slot/pub | ⚠️ Mode + infrastructure |
| `test_wal_cdc_captures_insert` | INSERT captured via WAL | ⚠️ Count only |
| `test_wal_cdc_captures_update` | UPDATE captured via WAL | ⚠️ Count + scalar |
| `test_wal_cdc_captures_delete` | DELETE captured via WAL | ⚠️ Count only |
| `test_trigger_mode_no_wal_transition` | cdc_mode='trigger' stays trigger | ⚠️ Mode check |
| `test_wal_fallback_on_missing_slot` | Slot dropped → fallback + data survives | ⚠️ Mode + count |
| `test_wal_cleanup_on_drop` | DROP ST → slot + pub cleaned | ⚠️ Infrastructure |
| `test_wal_keyless_table_stays_on_triggers` | Keyless → stays trigger | ⚠️ Mode check |
| `test_ec18_check_cdc_health_shows_trigger_for_stuck_auto` | EC-18: keyless auto → TRIGGER | ⚠️ Health check |
| `test_ec18_health_check_ok_with_trigger_auto_sources` | EC-18: no errors for trigger auto | ⚠️ Count |
| `test_ec34_check_cdc_health_detects_missing_slot` | EC-34: missing slot alert + fallback | ⚠️ Alert + mode + count |
| `test_ec19_wal_keyless_without_replica_identity_full_rejected` | Keyless + no RIF rejected | ✅ Error validation |
| `test_ec19_wal_keyless_with_replica_identity_full_accepted` | Keyless + RIF accepted | ⚠️ Mode check |

**Verdict: ADEQUATE for CDC mode transitions, WEAK for WAL data correctness**

**Critical gap:** Zero multiset comparisons. Tests 6–8 (INSERT/UPDATE/DELETE
via WAL CDC) only verify count or scalar values — they never verify the actual
captured data matches the source. A WAL decoding bug that produces wrong
column values would pass all tests.

---

## Cross-Cutting Findings

### Finding 1: Multiset Comparison Usage is Bimodal

The suite splits sharply into two camps:

**Files with strong multiset coverage (≥60%):**
- `e2e_differential_gaps_tests` — 13/13 (100%)
- `e2e_dag_autorefresh_tests` — 4/5 (80%)
- `e2e_multi_cycle_tests` — 6/9 (67%)
- `e2e_guc_variation_tests` — 8/13 (62%)

**Files with weak/no multiset coverage (≤22%):**
- `e2e_ddl_event_tests` — 0/14 (0%)
- `e2e_circular_tests` — 0/6 (0%)
- `e2e_partition_tests` — 0/9 (0%)
- `e2e_rls_tests` — 0/9 (0%)
- `e2e_wal_cdc_tests` — 0/17 (0%)
- `e2e_append_only_tests` — 0/10 (0%)
- `e2e_bootstrap_gating_tests` — 0/18 (0%)
- `e2e_bench_tests` — 0/32 (0%)
- `e2e_cascade_regression_tests` — 0/8 (0%) (though uses ad-hoc value checks)
- `e2e_bgworker_tests` — 2/9 (22%)

This suggests the multiset pattern was adopted partway through development.
Files written earlier or focused on infrastructure tend to lack it.

### Finding 2: Count-Only Tests Create False Confidence

62 tests use `db.count()` as their primary data assertion. This catches:
- ✅ Missing rows (count too low)
- ✅ Duplicate rows (count too high)

But cannot catch:
- ❌ Wrong column values
- ❌ Wrong row composition (right count, wrong data)
- ❌ NULL corruption
- ❌ Type coercion bugs

For example, a partition test that verifies `count = 3` would pass even if all
three rows have incorrect values derived from the wrong partition.

### Finding 3: WAL CDC Data Path is Unvalidated

The 17 WAL CDC tests thoroughly validate mode transitions (TRIGGER → WAL),
infrastructure (slots, publications), and fallback behavior. But the actual
data path — whether WAL-decoded INSERTs/UPDATEs/DELETEs produce correct
stream table content — is verified with counts only.

This is a significant blind spot because WAL decoding involves complex binary
parsing of the replication stream, and a subtle bug could produce wrong values
that pass all count assertions.

### Finding 4: DDL Event Tests Missing Post-Reinit Validation

When a DDL change (ALTER COLUMN TYPE, function replacement, RLS change) marks
a stream table as `needs_reinit`, the tests verify:
- ✅ The `needs_reinit` flag is set
- ⚠️ The reinit can execute (sometimes)
- ❌ The data after reinit is correct (never)

This means the DDL detection works, but whether the recovery path produces
correct data is untested at the full E2E level.

### Finding 5: RLS Test Has a Superuser Bypass Flaw

`test_rls_on_stream_table_filters_reads` intends to verify that RLS filters
rows when querying a stream table. However, it appears to run queries as the
superuser, who bypasses RLS by default. The test should:
1. Create a restricted role
2. Enable RLS on the stream table
3. Query as the restricted role
4. Verify filtered results

### Finding 6: Benchmark Tests as Silent Correctness Regression Vector

The 32 benchmark tests (`#[ignore]`) exercise all major query types (scan,
filter, aggregate, join, window, lateral, CTE, UNION) with real DML cycles
and multi-cycle refreshes. Yet none assert data correctness. These tests are
actually exercising the most complex code paths in the DVM engine — adding a
single `assert_st_matches_query` call at the end of each benchmark would be
extremely high-value with negligible performance impact.

---

## Priority Mitigations

### P0 — Critical (Data Integrity Gaps)

#### P0-1: Add Multiset Comparison to WAL CDC Data Tests

Tests 6–8 (`captures_insert`, `captures_update`, `captures_delete`) should
verify data correctness after WAL-captured changes:

```rust
// Current (WEAK):
let count: i64 = db.count("wal_st").await;
assert_eq!(count, 3);

// Proposed (STRONG):
db.assert_st_matches_query("wal_st", "SELECT id, val FROM wal_source").await;
```

Also add multiset to test 10 (fallback) and test 15 (EC-34 missing slot).

**Impact:** 5 tests converted from weak to strong. Validates the entire WAL
decoding → change buffer → differential refresh pipeline.

#### P0-2: Add Multiset to Partition Tests

All non-foreign-table tests should use `assert_st_matches_query`:

```rust
// For each partition type (RANGE, LIST, HASH):
db.assert_st_matches_query("part_st", "SELECT id, val FROM part_source").await;

// For aggregation tests:
db.assert_st_matches_query("part_agg_st",
    "SELECT region, SUM(amount) FROM part_sales GROUP BY region"
).await;
```

**Impact:** 7 tests converted. Validates partition pruning doesn't corrupt
results.

#### P0-3: Add Multiset to DDL Event Post-Reinit Tests

After setting `needs_reinit` and triggering reinit, verify data:

```rust
// After function change + reinit:
db.refresh_st("fn_st").await; // triggers reinit
db.assert_st_matches_query("fn_st", "SELECT id, my_func(val) FROM source").await;

// After ALTER COLUMN TYPE + reinit:
db.refresh_st("col_st").await;
db.assert_st_matches_query("col_st", "SELECT id, val::new_type FROM source").await;
```

**Impact:** 4–6 tests improved. Validates that DDL recovery produces correct data.

#### P0-4: Add Data Verification to Circular ST Tests

After cycle convergence, verify actual data content:

```rust
db.assert_st_matches_query("cyc_a",
    "SELECT DISTINCT src, dst FROM expected_transitive_closure"
).await;
```

**Impact:** 2 tests improved. Validates convergence correctness, not just
convergence detection.

### P1 — High (Coverage Hardening)

#### P1-1: Fix RLS Superuser Bypass in Test

Add a restricted role and query as that role:

```rust
db.execute("CREATE ROLE rls_reader").await;
db.execute("GRANT SELECT ON rls_st TO rls_reader").await;
db.execute("SET ROLE rls_reader").await;
let count: i64 = db.count("rls_st").await;
assert_eq!(count, expected_filtered_count);
db.execute("RESET ROLE").await;
```

**Impact:** Validates actual RLS filtering, not just that RLS is enabled.

#### P1-2: Add Multiset to Append-Only Fallback Tests

After fallback from append-only to MERGE:

```rust
db.assert_st_matches_query("ao_st", "SELECT id, val FROM ao_source").await;
```

**Impact:** 3 tests improved. Validates fallback produces correct data.

#### P1-3: Add Multiset to Cascade Regression Tests

Tests 3 and 6 (DELETE cascade, 3-layer INSERT) should use multiset:

```rust
// 3-layer cascade:
db.assert_st_matches_query("l3_st",
    "SELECT id, val * 2 + 10 FROM base_source"
).await;
```

**Impact:** 2 tests improved.

#### P1-4: Add Multiset to Bootstrap Gating Refresh Tests

Tests 12 and 17 (manual refresh through gate lifecycle):

```rust
db.assert_st_matches_query("gated_st", "SELECT id, val FROM gated_source").await;
```

**Impact:** 2 tests improved.

### P2 — Medium (Completeness)

#### P2-1: Add Smoke Correctness Check to Benchmarks

At the end of each benchmark variant, add one `assert_st_matches_query`:

```rust
// After final benchmark cycle:
db.assert_st_matches_query(&st_name, &defining_query).await;
```

This adds ~50ms per benchmark but catches DVM correctness regressions during
performance testing.

**Impact:** 32 tests gain correctness assertion. Extremely high value.

#### P2-2: Add ALTER QUERY + DML Cycle Tests

`e2e_alter_query_tests` needs tests that:
1. Create ST, populate with data
2. ALTER QUERY to join/aggregate
3. Refresh
4. Verify with `assert_st_matches_query`

Currently, ALTER tests verify schema changes succeed but not data correctness
for complex query transformations.

#### P2-3: Add Upgrade Chain Data Validation

The 7 `#[ignore]` upgrade chain tests should add `assert_st_matches_query`
after verifying STs survive the upgrade:

```rust
// After upgrade:
db.assert_st_matches_query("pre_upgrade_st",
    "SELECT id, val FROM pre_upgrade_source"
).await;
```

#### P2-4: Add Non-Convergence Test with Guaranteed Divergence

`test_circular_nonconvergence_error_status` should use DML that guarantees
divergence (e.g., monotonically increasing counts) rather than relying on
timing.

### P3 — Low (Polish)

#### P3-1: Consolidate Cascade Value Checks to Multiset

`e2e_cascade_regression_tests` uses ad-hoc value comparisons (amount "450",
categories ["X", "Y"]). Replace with `assert_st_matches_query` for consistency
with the rest of the suite.

#### P3-2: Add DELETE/UPDATE to Bootstrap Gating Tests

Current gating tests only INSERT. Add UPDATE and DELETE during the gate →
ungate → re-gate lifecycle.

#### P3-3: Standardize bgworker Test Assertions

Tests 4 and 8 (auto-refresh within schedule, multiple STs) use count only.
Add multiset comparison for consistency.

---

## Appendix: Coverage Matrix

### Full E2E Files: Summary Table

| File | Lines | Tests | Multiset Calls | Multiset % | DML Cycle? | Verdict |
|------|-------|-------|---------------|------------|-----------|---------|
| `e2e_differential_gaps_tests` | 526 | 13 | 39 | 100% | ✅ Full I/U/D | **STRONG** |
| `e2e_dag_autorefresh_tests` | 449 | 5 | 8 | 80% | ✅ Insert cycle | **STRONG** |
| `e2e_multi_cycle_tests` | 534 | 9 | 21 | 67% | ✅ Full I/U/D | **STRONG** |
| `e2e_guc_variation_tests` | 430 | 13 | 10 | 62% | ✅ Insert/delete | **STRONG** |
| `e2e_cascade_regression_tests` | 796 | 8 | 0 | 0%* | ✅ I/U/D | **ADEQUATE** |
| `e2e_bgworker_tests` | 570 | 9 | 2 | 22% | ✅ Insert | **ADEQUATE** |
| `e2e_user_trigger_tests` | 649 | 11 | 2 | 18% | ✅ Full I/U/D | **ADEQUATE** |
| `e2e_alter_query_tests` | 578 | 15 | 1 | 7% | ⚠️ Limited | **ADEQUATE** |
| `e2e_upgrade_tests` | 871 | 14 | 1 | 7% | ⚠️ Round-trip | **ADEQUATE** |
| `e2e_bootstrap_gating_tests` | 637 | 18 | 0 | 0% | ⚠️ Insert only | **ADEQUATE** |
| `e2e_phase4_ergonomics_tests` | 577 | 20 | 0 | N/A | ❌ Metadata | **ADEQUATE** |
| `e2e_append_only_tests` | 342 | 10 | 0 | 0% | ⚠️ Insert + fallback | **ADEQUATE** |
| `e2e_ddl_event_tests` | 608 | 14 | 0 | 0% | ⚠️ DDL only | **WEAK** |
| `e2e_wal_cdc_tests` | 729 | 17 | 0 | 0% | ⚠️ Single DML | **WEAK** |
| `e2e_partition_tests` | 554 | 9 | 0 | 0% | ⚠️ Limited I/U/D | **WEAK** |
| `e2e_circular_tests` | 562 | 6 | 0 | 0% | ❌ No DML verify | **WEAK** |
| `e2e_rls_tests` | 453 | 9 | 0 | 0% | ⚠️ Insert only | **WEAK** |
| `e2e_bench_tests` | 2,156 | 32 | 0 | 0% | ✅ Multi-cycle | **WEAK** |
| **TOTAL** | **~11,021** | **222** | **84** | **17%** | — | — |

\* `e2e_cascade_regression_tests` uses ad-hoc value checks instead of `assert_st_matches_query`.

### Assertion Type Distribution

| Assertion Type | Test Count | % |
|---------------|-----------|---|
| `assert_st_matches_query` (multiset) | 37 | 17% |
| Explicit value comparison | 12 | 5% |
| Error path validation | 22 | 10% |
| Metadata / flag / status | 68 | 31% |
| Count only (`db.count()`) | 62 | 28% |
| Timing / benchmark | 32 | 14% |
| **Total** | **222** | — |

### Feature Coverage by Test File

| Feature | Test File(s) | Coverage Level |
|---------|-------------|---------------|
| Differential refresh (core) | differential_gaps, multi_cycle | ✅ Strong |
| DAG cascade + autorefresh | dag_autorefresh | ✅ Strong |
| GUC configurability | guc_variation | ✅ Strong |
| ALTER QUERY operations | alter_query | ⚠️ Adequate |
| Background worker / scheduler | bgworker | ⚠️ Adequate |
| Bootstrap gating | bootstrap_gating | ⚠️ Adequate |
| User-defined triggers | user_trigger | ⚠️ Adequate |
| Extension upgrade paths | upgrade | ⚠️ Adequate |
| ST-on-ST cascades | cascade_regression | ⚠️ Adequate |
| Append-only optimization | append_only | ⚠️ Adequate |
| API ergonomics | phase4_ergonomics | ⚠️ Adequate (metadata) |
| WAL-based CDC | wal_cdc | ❌ Weak (data path) |
| Partitioned tables | partition | ❌ Weak |
| DDL event reactions | ddl_event | ❌ Weak (post-reinit) |
| Circular dependencies | circular | ❌ Weak |
| Row-Level Security | rls | ❌ Weak |
| Performance benchmarks | bench | ❌ Weak (no correctness) |