> **Plain-language companion:** [v0.27.0.md](v0.27.0.md) ## v0.27.0 — Operability, Observability & DR **Status: Planned.** Sourced from [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §4, §7, §9 — the remaining actionable items not addressed in v0.24.0–v0.26.0. > **Release Theme** > This release closes the final pre-1.0 operability gaps identified in the > v0.23.0 deep-analysis report. Four complementary themes: (1) a **snapshot > and PITR API** so fresh replicas can bootstrap from a point-in-time > export rather than re-running the full defining query; (2) a > **predictive maintenance window planner** that turns the v0.22 cost model > into actionable schedule recommendations; (3) a **cluster-wide > observability layer** exposing per-database worker allocation from the > postmaster and adding per-DB Prometheus metric labels; and (4) > **OpenMetrics conformance hardening** for the metrics endpoint, including > cluster-wide aggregation and a conformance test. Together these items > leave pg_trickle well-positioned for the v1.0 stable release. ### Stream-Table Snapshot & Point-in-Time Restore > **In plain terms:** `snapshot_stream_table()` exports the current content > of a stream table — its frontier, content hash, and all rows — into an > archival companion table. `restore_from_snapshot()` rehydrates that state > on a fresh replica in seconds, skipping the full defining-query > re-execution. Aligns ST state with logical wall-clock for PITR workflows. | Item | Description | Effort | Ref | |------|-------------|--------|-----| | SNAP-1 | **`snapshot_stream_table(name, target)` SQL function.** Exports `(pgt_id, frontier, content_hash, rows)` to an archival table `pgtrickle.snapshot__`. Creates the table if it does not exist; overwrites with `CREATE TABLE … AS SELECT`. Snapshot includes the frontier LSN and current `pgt_stream_tables` metadata row. `SnapshotAlreadyExists` error variant if `target` is given and already occupied. | 2d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §7 | | SNAP-2 | **`restore_from_snapshot(name, source)` SQL function.** Rehydrates a stream table from a snapshot table created by SNAP-1. Replays the archived frontier into `pgt_stream_tables`, bulk-inserts rows, skips the initial full-refresh cycle. `SnapshotSourceNotFound`, `SnapshotSchemaVersionMismatch` error variants. | 2d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §7 | | SNAP-3 | **`list_snapshots(name)` + `drop_snapshot(snapshot_table)`.** Monitoring function returning all snapshots for a given ST (name, creation time, row count, frontier, size_bytes). `drop_snapshot` drops the archival table and removes it from the metadata catalog. | 0.5d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §7 | | SNAP-4 | **Tests.** Integration: snapshot → drop ST → restore → verify rows and frontier match; schema-version mismatch returns error; snapshot on IMMEDIATE-mode ST. E2E: fresh-replica bootstrap via snapshot completes in < 5 s for 1M-row ST. | 1d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §7 | | SNAP-5 | **Documentation.** SQL_REFERENCE.md: snapshot/restore API. PATTERNS.md: "Replica Bootstrap & PITR Alignment" section. BACKUP_AND_RESTORE.md: updated to cover the snapshot path alongside `pg_dump`. | 0.5d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §7 | > **Snapshot/PITR subtotal: ~6 days** ### Predictive Maintenance Window Planner > **In plain terms:** `recommend_schedule(name)` analyses the per-ST > cost-model history (accumulated since v0.22) and returns a recommended > `refresh_interval`, peak-window `cron` expression, and confidence score. > A longer-term extension can flag expected cost spikes in advance so > operators can act before SLA breaches occur. | Item | Description | Effort | Ref | |------|-------------|--------|-----| | PLAN-1 | **`pgtrickle.recommend_schedule(name)` SQL function.** Returns a single JSONB row with `recommended_interval_seconds`, `peak_window_cron`, `confidence` (0–1), and `reasoning` (text). Uses the per-ST `last_full_ms`/`last_diff_ms` history and the v0.25.0 median+MAD model. Confidence is `0.0` if fewer than `pg_trickle.schedule_recommendation_min_samples` (default 20) observations are available. | 2d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §7 | | PLAN-2 | **`pgtrickle.schedule_recommendations()` set-returning function.** Returns one row per registered ST with `name`, `current_interval_seconds`, `recommended_interval_seconds`, `delta_pct`, `confidence`, `reasoning`. Sortable by `delta_pct DESC` so operators can quickly find the most mis-tuned STs. | 1d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §7 | | PLAN-3 | **Spike-forecast alert.** Post-tick hook: if the cost model predicts the next refresh will exceed the ST's SLA by > 20 %, emit a `pg_trickle_alert` event `predicted_sla_breach` with `stream_table`, `predicted_ms`, and `sla_ms`. Alert is debounced — at most one per `pg_trickle.schedule_alert_cooldown_seconds` (default 300 s). | 1.5d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §7 | | PLAN-4 | **Tests + documentation.** Unit: `recommend_schedule` returns `confidence = 0.0` before `min_samples`; returns non-trivial recommendation after synthetic history injection; spike-forecast alert fires exactly once per cooldown window. SQL_REFERENCE.md: `recommend_schedule` + `schedule_recommendations` API. CONFIGURATION.md: two new GUCs. | 1d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §7 | > **Predictive planner subtotal: ~5.5 days** ### Cluster-Wide Observability > **In plain terms:** The v0.25.0 `worker_allocation_status()` view covers > per-database quota usage but only from within a single database connection. > This adds a postmaster-level cluster summary visible from any database, tags > every Prometheus metric with `db_oid` (enabling per-DB Grafana panels across > a cluster), and publishes a multi-tenant deployment guide. | Item | Description | Effort | Ref | |------|-------------|--------|-----| | CLUS-1 | **`pgtrickle.cluster_worker_summary()` SQL function.** Reads the shared-memory worker-pool shmem block (already populated by all DB bgworkers) and returns one row per database: `db_oid`, `db_name`, `workers_active`, `workers_queued`, `quota`, `quota_utilization_pct`. Accessible from any database in the cluster without cross-DB SPI. | 2d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §4 | | CLUS-2 | **Per-DB Prometheus metric labels.** Tag all metrics emitted by `src/metrics_server.rs` with `db_oid=` and `db_name=` labels. Enables per-DB Grafana panels and per-DB alerting rules without separate endpoints. | 1d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §4, §9 | | CLUS-3 | **`docs/integrations/multi-tenant.md` (new page).** Documents recommended multi-DB deployment patterns: quota allocation formula (`ceil(total_workers / N_databases)`), GUC configuration, Grafana dashboard snippets using `db_name` labels, and `cluster_worker_summary()` usage. | 0.5d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §4 | | CLUS-4 | **`docs/SCALING.md` update.** Add a "Cluster-wide worker fairness" section cross-referencing `cluster_worker_summary()`, the new quota GUC documentation, and the multi-tenant integration page. | 0.5d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §4 | > **Cluster observability subtotal: ~4 days** ### OpenMetrics Conformance & Metrics Hardening > **In plain terms:** The `src/metrics_server.rs` endpoint introduced in > v0.20 has zero unit tests and no validation that its output conforms to > the OpenMetrics text format. This item adds a conformance test, port-conflict > and timeout handling, and a cluster-wide aggregation view. | Item | Description | Effort | Ref | |------|-------------|--------|-----| | METR-1 | **OpenMetrics conformance test.** Parse the `/metrics` output with the `openmetrics_parser` crate (or equivalent) and assert no validation errors. Run as a unit test in `src/metrics_server.rs` using a mock request. | 1d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §9 | | METR-2 | **Port-conflict and timeout unit tests.** Test that `metrics_server::start()` returns a typed `MetricsServerError::PortInUse` when the port is occupied, and `MetricsServerError::Timeout` when the request handler exceeds `pg_trickle.metrics_request_timeout_ms` (new GUC, default 5000 ms). | 1d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §9 | | METR-3 | **`pgtrickle.metrics_summary()` cluster-wide aggregation view.** Set-returning function that aggregates key counters across all databases visible in `pg_stat_activity` (refresh count, error count, worker utilisation). Feeds the cluster-level Grafana overview dashboard. | 1.5d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §9 | | METR-4 | **Malformed-HTTP handler.** Catch malformed HTTP requests to the metrics endpoint; return 400 Bad Request with a plain-text error body rather than panicking. Add unit test. | 0.5d | [PLAN_OVERALL_ASSESSMENT_2.md](plans/PLAN_OVERALL_ASSESSMENT_2.md) §9 | > **Metrics hardening subtotal: ~4 days** ### Dependency Upgrades > **In plain terms:** pgrx 0.18.0 updates the proc-macro and SPI interfaces. > This item upgrades the dependency, audits all `pg_sys::*` call sites for > breaking changes, and validates the full test suite under the new version. | Item | Description | Effort | Ref | |------|-------------|--------|-----| | DEP-1 | **pgrx 0.17.0 → 0.18.0 upgrade.** Bump `pgrx` and `pgrx-tests` in `Cargo.toml`; run `cargo pgrx init` for the target PG 18 version; resolve any API breakage in proc-macro annotations, SPI call sites, memory-context helpers, and `pg_sys::*` usages. | 1–2d | — | | DEP-2 | **Full test suite validation.** Run `just test-all` under pgrx 0.18.0; fix any regressions. Update `AGENTS.md` pgrx version reference. | 0.5d | — | > **Dependency upgrades subtotal: ~1.5–2.5 days** ### Implementation Phases | Phase | Description | Duration | |-------|-------------|----------| | Phase 1 | Snapshot/PITR: catalog, SQL functions, tests, documentation | Days 1–6 | | Phase 2 | Predictive planner: `recommend_schedule`, `schedule_recommendations`, spike-forecast alert, tests | Days 6–11.5 | | Phase 3 | Cluster observability: `cluster_worker_summary`, per-DB labels, multi-tenant docs, SCALING.md | Days 11.5–15.5 | | Phase 4 | Metrics hardening: OpenMetrics conformance, port-conflict tests, aggregation view, malformed-HTTP handler | Days 15.5–19.5 | | Phase 5 | Dependency upgrades: pgrx 0.18.0, full test-suite validation | Days 19.5–21.5 | | Phase 6 | Integration testing, upgrade script, documentation review | Days 21.5–24 | > **v0.27.0 total: ~3–4 weeks** (~24 person-days solo) **Exit criteria:** - [x] SNAP-1: `snapshot_stream_table()` creates archival table with correct frontier and row data - [x] SNAP-2: `restore_from_snapshot()` rehydrates ST; first refresh cycle after restore is DIFFERENTIAL (not FULL) - [x] SNAP-3: `list_snapshots()` lists all snapshots for a ST; `drop_snapshot()` removes archival table and catalog row - [x] SNAP-4: Fresh-replica bootstrap via snapshot completes in < 5 s for 1M-row ST - [x] SNAP-5: BACKUP_AND_RESTORE.md updated; PATTERNS.md "Replica Bootstrap & PITR Alignment" section added - [x] PLAN-1: `recommend_schedule()` returns `confidence = 0.0` before `min_samples`; returns non-trivial recommendation with synthetic history - [x] PLAN-2: `schedule_recommendations()` returns one row per ST; sortable by `delta_pct` - [x] PLAN-3: `predicted_sla_breach` alert fires once per cooldown window; no duplicate alerts - [x] PLAN-4: All unit tests for planner pass; two new GUCs documented - [x] CLUS-1: `cluster_worker_summary()` returns accurate per-DB worker counts from any database in the cluster - [x] CLUS-2: All Prometheus metrics carry `db_oid` and `db_name` labels; existing Grafana dashboard templates updated - [x] CLUS-3: `docs/integrations/multi-tenant.md` published with quota formula and Grafana snippets - [x] CLUS-4: `docs/SCALING.md` cluster-wide fairness section added - [x] METR-1: OpenMetrics conformance test passes; zero parse errors on live `/metrics` output - [x] METR-2: Port-conflict test returns `MetricsServerError::PortInUse`; timeout test returns `MetricsServerError::Timeout` - [x] METR-3: `metrics_summary()` returns aggregated counters; Grafana cluster-overview query documented - [x] METR-4: Malformed HTTP request returns 400 Bad Request; no panic - [x] DEP-1: pgrx bumped to 0.18.0; all API breakage resolved; extension builds clean - [x] DEP-2: `just test-all` passes under pgrx 0.18.0; `AGENTS.md` pgrx version reference updated - [x] Extension upgrade path tested (`0.26.0 → 0.27.0`) - [x] `just check-version-sync` passes ---