# v0.27.0 — Operability, Observability, and Disaster Recovery

> **Full technical details:** [v0.27.0.md-full.md](v0.27.0.md-full.md)

**Status: Planned** | **Scope: Medium** (~3–4 weeks)

> Snapshot and point-in-time restore for stream tables, predictive schedule
> recommendations, cluster-wide worker visibility, OpenMetrics conformance,
> and an upgrade to pgrx 0.18.

---

## What problem does this solve?

As pg_trickle approached its 1.0 milestone, a final set of operational gaps
became the focus: bootstrapping a fresh replica with a stream table's content
without re-running the defining query from scratch (too slow for large tables),
turning the accumulated cost model history into actionable schedule
recommendations, making worker allocation visible across all databases in a
cluster, and ensuring the Prometheus metrics endpoint is formally conformant.

---

## Stream Table Snapshot and Point-in-Time Restore

**`pgtrickle.snapshot_stream_table(name, target)`** exports the complete state
of a stream table — its current rows, its frontier (the position in the change
history up to which it has been refreshed), and its metadata — into an archival
companion table. This snapshot can be taken at any time and transferred to
another database instance.

**`pgtrickle.restore_from_snapshot(name, source)`** rehydrates a stream table
from a snapshot on a fresh instance. The stream table is populated from the
snapshot, and the first refresh cycle after restore runs differentially
(catching up only from the snapshot's frontier), rather than recomputing
everything from scratch.

*In plain terms:* if you add a new database replica, you no longer need to
wait for the stream tables to rebuild from scratch (which could take minutes
or hours for large tables). Copy the snapshot, restore it, and the replica's
stream tables are immediately current within milliseconds.

`pgtrickle.list_snapshots(name)` and `pgtrickle.drop_snapshot(table)` manage
the snapshot lifecycle.

---

## Predictive Maintenance Window Planner

With months of refresh history accumulated from the cost model (v0.22.0
onwards), pg_trickle can now turn that history into recommendations:

**`pgtrickle.recommend_schedule(name)`** analyses the stream table's refresh
performance history and returns:

- A recommended refresh interval (shorter if the current one is too long for
  the observed latency, longer if it is unnecessarily tight)
- A suggested cron expression for off-peak scheduling
- A confidence score (0–1 based on how much history is available)

**`pgtrickle.schedule_recommendations()`** returns one row per stream table,
sorted by how far the current schedule deviates from the recommendation —
making it easy to find the most mis-configured stream tables at a glance.

**Spike-forecast alerts** — when the cost model predicts the next refresh
will breach the stream table's SLA by more than 20%, a
`pg_trickle_alert predicted_sla_breach` notification is sent, with a
debounce to avoid alert storms.

---

## Cluster-Wide Worker Observability

`pgtrickle.cluster_worker_summary()` reads from shared memory and returns
one row per database in the cluster — worker count, queue depth, quota, and
utilisation percentage — accessible from any database connection without
cross-database SPI.

All Prometheus metrics now carry `db_oid` and `db_name` labels, enabling
per-database panels in Grafana dashboards across a multi-database cluster.

A new `docs/integrations/multi-tenant.md` guide covers recommended worker
quota allocation and Grafana configuration for multi-database deployments.

---

## OpenMetrics Conformance

The Prometheus metrics endpoint introduced in v0.21.0 had not been formally
validated against the OpenMetrics specification. A conformance test now parses
the `/metrics` output and fails if any format violations are found.

Port-conflict and timeout errors from the metrics server are now typed
(`MetricsServerError::PortInUse`, `MetricsServerError::Timeout`) rather than
bare panics. Malformed HTTP requests to the metrics endpoint return a
`400 Bad Request` response instead of crashing.

`pgtrickle.metrics_summary()` provides a cross-database aggregate view of
key counters, suitable for a cluster-overview Grafana dashboard.

---

## pgrx 0.18 Upgrade

The pgrx library (the framework that pg_trickle uses to interact with
PostgreSQL internals) was upgraded from 0.17 to 0.18. This brings updated
SPI interfaces, improved proc-macro support, and compatibility with the
latest PostgreSQL 18 API changes.

---

## Scope

v0.27.0 is the final pre-1.0 operability release. The snapshot/restore API
solves a real operational pain point for replica bootstrapping. The schedule
planner turns accumulated data into actionable recommendations. Cluster-wide
observability and OpenMetrics conformance round out the production-readiness
story ahead of the stable v1.0 release.