# v0.72.0 Full Details — Frontier Durability & Catalog Correctness

> **Summary:** [v0.72.0.md](v0.72.0.md)
> **Assessment source:** [plans/PLAN_OVERALL_ASSESSMENT_14.md](../plans/PLAN_OVERALL_ASSESSMENT_14.md)

---

## Motivation

Assessment 14 identified a cluster of correctness findings where the code
*appeared* to implement a safety guarantee but the guarantee was either wired
incorrectly, had no call site, or relied on invalid SQL. None of these were
latent design flaws requiring a redesign — they were wiring gaps. v0.72.0
closes all four.

The theme of this release is: **if a path exists in the code it must actually
work, and if it must work it must be tested**.

---

## Detailed Implementation

### COR-001 / REL-001 / ARCH-001: DUR-1 Tentative-Frontier Removal

**Problem:**

The catalog layer contained a "two-phase tentative-frontier" design (DUR-1):

1. `prepare_frontier(pgt_id, frontier)` — write proposed frontier to
   `tentative_frontier` column before MERGE.
2. `finalize_frontier_and_complete_refresh(pgt_id, rows)` (DUR-1) — promote
   `tentative_frontier → frontier` after MERGE commits.
3. `reconcile_tentative_frontiers(change_schema)` — startup recovery scan for
   stale tentative frontiers.

All three functions had **zero production call sites**. The recovery query in
`reconcile_tentative_frontiers` contained a string-concatenation fragment
(`FROM {schema}.changes_ || s.pgt_relid::text`) that would always fail at
runtime — it is not valid relation-name syntax.

The scheduler hot-path used the single-phase `store_frontier` but silenced
errors:
```rust
// ❌ BEFORE (6 locations in scheduler/mod.rs)
if let Err(e) = StreamTableMeta::store_frontier(id, &frontier) {
    log!("Could not store frontier: {}", e);
}
```

This meant: even if DUR-1 had been wired, the actual call site provided no
durability guarantee — a frontier-store failure would log a warning and commit
the data change with no frontier advancement, causing double-processing on the
next tick.

**Fix:**

- Removed `prepare_frontier`, `finalize_frontier_and_complete_refresh` (DUR-1
  variant), and `reconcile_tentative_frontiers` from `src/catalog.rs`.
- All six scheduler refresh paths in `src/scheduler/mod.rs` converted to:
  ```rust
  // ✅ AFTER
  StreamTableMeta::store_frontier(pgt_id, &frontier)?;
  ```
  A frontier-store failure now aborts the outer transaction, rolling back the
  data change. The stream table is retried on the next scheduler tick.
- The `tentative_frontier` column is retained in the schema (removing a catalog
  column requires a new migration) but is never written.
- See [ADR-004](../plans/adrs/ADR-004.md) for the architectural rationale.

**Files changed:**
- `src/catalog.rs` — removed 3 DUR-1 functions; added CODE-001 comment block
- `src/scheduler/mod.rs` — 6 log-only patterns → `?` propagation

**Tests:** TEST-003 unit tests added to `src/catalog.rs`; compile-time guard
for absence of DUR-1 functions; frontier JSON roundtrip; `is_empty()` checks.

---

### COR-002 / API-001: Outbox `stream_table_oid` OID Correctness

**Problem:**

`pgtrickle.pgt_outbox_config.stream_table_oid` was populated with
`pgt_id as u32` cast to `pg_sys::Oid`. `pgt_id` is a sequential integer
primary key of `pgt_stream_tables` — not an OID present in `pg_class`.
Any consumer that resolved the OID via `pg_class` would either find the wrong
table or find nothing at all.

The read paths (`is_outbox_enabled`, `get_outbox_table_name`,
`get_embedding_vector_column`) used `WHERE stream_table_oid = $1` with
`pgt_id` as the parameter — the queries worked by accident only because the
write also stored `pgt_id` in that column.

**Fix:**

All five write/read paths in `src/api/outbox.rs`:

| Function | Before | After |
|----------|--------|-------|
| `attach_outbox_impl` INSERT | `pgt_id as u32` | `meta.pgt_relid.into()` |
| `detach_outbox_impl` DELETE | `pgt_id as u32` | `meta.pgt_relid.into()` |
| `attach_embedding_outbox_impl` UPDATE | `WHERE pgt_schema=$2 AND pgt_name=$3` | sub-SELECT on `pgt_relid` |
| `is_outbox_enabled` | `WHERE stream_table_oid = pgt_id` | JOIN through `pgt_stream_tables` |
| `get_outbox_table_name` | `WHERE stream_table_oid = pgt_id` | JOIN through `pgt_stream_tables` |
| `get_embedding_vector_column` | `WHERE stream_table_oid = pgt_id` | JOIN through `pgt_stream_tables` |

The migration SQL corrects existing rows in production databases.

**Files changed:**
- `src/api/outbox.rs` — all 5 locations

**Tests:** TEST-001 E2E tests added to `tests/e2e_outbox_tests.rs`:
- `test_outbox_stream_table_oid_equals_pgt_relid` — JOIN invariant
- `test_outbox_stream_table_oid_exists_in_pg_class` — `pg_class.oid` existence

---

### COR-003: WAL Transition Handoff Gate

**Problem:**

`complete_wal_transition` in `src/wal_decoder.rs` executed:
1. `cdc::drop_change_trigger(source_oid)` — removes the row-level trigger
2. `StDependency::update_cdc_mode_for_source(source_oid, CdcMode::Wal)`

In the window between step 1 and step 2, the CDC trigger was gone but the
catalog still indicated TRIGGER mode. Any writes to the source table in this
window would neither be captured by the trigger nor by the WAL reader (which
checks the catalog mode before processing). Those writes could be permanently
lost.

**Fix:**

Steps are reversed and wrapped in `pg_advisory_lock(-(oid_u32 as i64))`:

1. Acquire advisory lock on the stream table OID (negative key avoids collision
   with positive-key advisory locks used elsewhere).
2. Update catalog mode to `WAL`.
3. Drop the CDC trigger.
4. Release the advisory lock.

If step 2 or 3 fails, the lock is released and the error is propagated. The
atomicity gap is eliminated: when the trigger is gone, the catalog has already
been updated to WAL mode.

**Files changed:**
- `src/wal_decoder.rs` — `complete_wal_transition` function

---

### COR-004: Pristine-Transaction Guard for Replication Slot Creation

**Problem:**

PostgreSQL forbids creating a replication slot in a transaction that has already
been assigned an XID (e.g. after any catalog write). `create_replication_slot_pristine`
had no guard for this, and would fail with PostgreSQL error code
`55006` (object_in_use) or similar in transactions that had performed DDL.

**Fix:**

Added a check using `pg_sys::GetCurrentTransactionIdIfAny()` before calling the
slot-creation primitive:

```rust
// SAFETY: GetCurrentTransactionIdIfAny reads a thread-local; no side effects.
unsafe {
    if pg_sys::GetCurrentTransactionIdIfAny() != pg_sys::InvalidTransactionId {
        return Err(PgTrickleError::ReplicationSlotError(
            "replication slot must be created in a transaction with no prior \
             writes; call create_replication_slot_pristine in a fresh \
             transaction".to_string(),
        ));
    }
}
```

This returns an actionable error message to the caller instead of propagating
a PostgreSQL internal error with no context.

**Files changed:**
- `src/wal_decoder.rs` — `create_replication_slot_pristine` function

---

## Migration Notes

`sql/pg_trickle--0.71.0--0.72.0.sql` corrects any existing
`pgt_outbox_config.stream_table_oid` values that were stored as the internal
`pgt_id` integer rather than the actual `pg_class` OID:

```sql
UPDATE pgtrickle.pgt_outbox_config oc
   SET stream_table_oid = st.pgt_relid
  FROM pgtrickle.pgt_stream_tables st
 WHERE oc.stream_table_oid = st.pgt_id::oid
   AND oc.stream_table_oid != st.pgt_relid;
```

No other schema changes are made. The `tentative_frontier` column is retained
but will always be NULL after this version.

---

## Test Coverage

| ID | Type | File | Description |
|----|------|------|-------------|
| TEST-001a | E2E | `tests/e2e_outbox_tests.rs` | `stream_table_oid = pgt_relid` JOIN invariant |
| TEST-001b | E2E | `tests/e2e_outbox_tests.rs` | `stream_table_oid` exists in `pg_class.oid` |
| TEST-003a | Unit | `src/catalog.rs` | Frontier JSON serialization roundtrip |
| TEST-003b | Unit | `src/catalog.rs` | Default frontier is empty |
| TEST-003c | Unit | `src/catalog.rs` | Frontier with entry is not empty |
| TEST-003d | Unit | `src/catalog.rs` | Compile-time guard: DUR-1 functions absent |

---

## ADR References

- [ADR-004: Frontier Durability Model](../plans/adrs/ADR-004.md)