# v0.35.0 — Correctness Sprint, Reactive Subscriptions & Zero-Downtime Operations

> **Full technical details:** [v0.35.0.md-full.md](v0.35.0.md-full.md)

**Status: Released** | **Scope: Large**

> A correctness-and-quality sprint that closes the three-assessment-old EC-01
> phantom-row bug and hardens Citus chaos tolerance, combined with two new
> user-facing capabilities: live push notifications and zero-downtime query
> changes.

---

## What is this?

v0.35.0 combines a mandatory quality sprint with two long-requested
operational capabilities:

1. **EC-01 correctness closeout** — the phantom-row residue in multi-table
   joins that has been tracked since v0.21.0 is fixed unconditionally.
   Every join delta is now routed through PH-D1 cleanup, and a 50,000-iteration
   property test proves convergence.

2. **Citus chaos hardening** — a new multi-container Docker Compose Citus
   test rig covers worker kill, coordinator failover, lease expiry, and
   rebalance scenarios that have had zero test coverage since Citus landed.

3. **Push notifications** — applications can subscribe to changes in a stream
   table and receive instant notifications, enabling real-time dashboards,
   live UIs, and event-driven microservices.

4. **Zero-downtime query changes** — modifying the defining query of a large
   stream table no longer requires a multi-minute lock on the table.

---

## EC-01 correctness closeout

The phantom-row residue bug (`is_deduplicated: false` at
`src/dvm/operators/join.rs:657-668`) has been flagged in every overall
assessment since v0.21.0. The v0.24.0 fix addressed the hash function but
not the downstream Z-set pipeline, leaving PH-D1 cross-cycle cleanup opt-in
and only invoked when a residual is detected — which itself depends on a flag
not set on every code path.

v0.35.0 closes this definitively in two steps:

1. **Immediate fix:** route every refresh cycle unconditionally through PH-D1
   with a batch size of 1,024 rows. The anti-join cost against the freshly
   applied delta is negligible and removes the residual-detection coupling.

2. **Proper fix:** re-engineer Part 2 row-id derivation so that Part 1a and
   Part 1b emit convergent ids; flip `is_deduplicated: true` for INNER joins
   on stable PKs; gate behind a 50,000-iteration proptest corpus.

Stream tables built on multi-table joins can then use DIFFERENTIAL refresh
with full correctness confidence.

---

## Citus chaos hardening

The `pgt_st_locks` distributed mutex and `ensure_worker_slot` / rebalance
recovery logic added in v0.32–v0.34 have never been tested under adversarial
conditions. v0.35.0 adds `tests/e2e_citus_chaos_tests.rs` backed by a
Docker Compose rig (coordinator + 3 workers) that drives:

- Kill-and-restart a worker during an active poll cycle
- Coordinator restart mid-lease acquisition
- `pg_dist_node` removal and re-add of a worker
- Sustained 1k-stream-table refresh under continuous node churn

A new `citus-tests.yml` GitHub Actions workflow runs this suite on every push
to `main`.

Two additional Citus scalability gaps close alongside the chaos rig:

- **`dblink` vs streaming libpq benchmark** (`CITUS-BENCH`) — the per-worker
  slot polling path has never been benchmarked. A new `benches/bench_remote_slot_poll.rs`
  compares `dblink`-wrapped `pg_logical_slot_get_changes()` against native libpq
  streaming at 1, 4, and 9 workers. If streaming delivers ≥ 30% lower p99 latency
  or ≥ 20% higher throughput the migration happens in the same PR; otherwise
  the `dblink` path is formally closed as the right choice.

- **Cross-shard join advisory** (`CITUS-XSHARD`) — when a distributed stream
  table is keyed on `__pgt_row_id` (a surrogate) rather than the source table's
  distribution column, any query joining the ST back to its source incurs a
  cross-shard re-partition join. pg_trickle now detects this at `create_stream_table()`
  time and emits a `NOTICE` suggesting the `output_distribution_column` parameter.
  The co-location status is recorded in `pgt_stream_tables.citus_colocated_with`
  and surfaced in the `citus_status` view.

---

## Reactive subscriptions

`pgtrickle.subscribe('my_stream_table', 'my_notification_channel')` registers
a listener. After every successful refresh that produces at least one change,
pg_trickle sends a PostgreSQL `NOTIFY` message to the named channel with a
payload like:

```json
{"name": "my_stream_table", "inserted_count": 12, "deleted_count": 3}
```

Any application holding a standard PostgreSQL connection and listening on
that channel receives this signal immediately, without polling. This powers
real-time dashboards, event-driven microservices, and reactive frontends —
using nothing but a standard PostgreSQL driver, with no Kafka, no Debezium,
no Hasura required.

A configurable coalescence window prevents notification storms when a stream
table refreshes at high frequency.

---

## Shadow-ST: zero-downtime query evolution

Today, calling `alter_query()` on a large stream table triggers a full
re-computation of the entire result set. For a stream table with millions of
rows, this can lock the table for minutes — an unacceptable operation in
production.

The new `shadow_build := true` parameter to `alter_query()` changes how
this works:

1. A parallel "shadow" stream table is created from the new query, invisible
   to users.
2. The shadow table is refreshed to convergence in the background, with no
   lock on the live table. The live table continues to serve reads and accept
   writes normally throughout.
3. When the shadow table has caught up, the storage is swapped atomically.
4. The new query goes live at the next refresh cycle. The shadow table is
   dropped.

The live table is readable and writable from start to finish.

---

## Also in v0.35.0

Beyond the headlining items, this release closes a wide range of quality and
operational gaps identified in the v7 overall assessment:

- `EXPLAIN STREAM TABLE` — see which DVM operators your query compiled to
- `pg_trickle.force_full_refresh` GUC for incident-response override
- `pg_trickle.enabled = false` now also gates CDC trigger writes
- History prune moved to a dedicated background worker with `LIMIT` batching
- SQLSTATE error classifier wired end-to-end (replaces English-text matching)
- Relay secret interpolation via `${ENV:VAR}` in connection strings
- Relay backpressure and reconnection backoff
- Lightweight SQLancer run added to every PR gate
- Grafana p50/p99 refresh latency panels and alert rules
- Citus tutorial, outbox→relay→Kafka tutorial, `pg_trickle_dump` runbook
- `NOTICE` emitted on every FULL fallback so operators can detect DVM limits
- Multi-architecture Docker images (arm64 + amd64)

---

## Scope

v0.35.0 is a large release. It is the single highest-priority release before
v1.0 because it closes the EC-01 correctness gap that affects every multi-table
join workload. All other 1.0-track features are downstream of this.

---

*Previous: [v0.34.0 — Citus: Automated Distributed CDC & Shard Recovery](v0.34.0.md)*
*Next: [v0.36.0 — Structural Hardening, Performance & Temporal IVM](v0.36.0.md)*
*Next: [v0.36.0 — Temporal IVM & Columnar Materialization](v0.36.0.md)*