> **Plain-language companion:** [v0.20.0.md](v0.20.0.md) ## v0.20.0 — Self-Monitoring (pg_trickle Monitors Itself) **Status: Released (2026-04-15).** All 62 items implemented, 1 skipped (PERF-6 already shipped in v0.19.0). > **Release Theme** > This release implements *self-monitoring*: pg_trickle uses its own stream > tables to maintain reactive analytics over its internal catalog and > refresh-history tables. Five self-monitoring stream tables (`df_efficiency_rolling`, > `df_anomaly_signals`, `df_threshold_advice`, `df_cdc_buffer_trends`, > `df_scheduling_interference`) replace repeated full-scan diagnostic > functions with continuously-maintained incremental views, enable > multi-cycle trend detection for threshold tuning, and surface anomalies > reactively. An optional auto-apply policy layer can automatically adjust > `auto_threshold` when confidence is high. This validates pg_trickle on > its own non-trivial workload and demonstrates the incremental analytics > value proposition to users. > > See [plans/PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) for the full > design, architecture, and risk analysis. ### Phase 1 — Foundation | Item | Description | Effort | Ref | |------|-------------|--------|-----| | DF-F1 | **Verify CDC on `pgt_refresh_history`.** Confirm that `create_stream_table()` installs INSERT triggers on `pgt_refresh_history`. Fix schema-exclusion logic if the `pgtrickle` schema is skipped. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §7 Phase 1 | | DF-F2 | **Create `df_efficiency_rolling` (DF-1).** Maintained rolling-window aggregates over `pgt_refresh_history`. Replaces `refresh_efficiency()` full scans. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §5 DF-1 | | DF-F3 | **E2E test: DF-1 output matches `refresh_efficiency()`.** Insert synthetic history rows, refresh DF-1, assert aggregates agree. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §8 | | DF-F4 | **`pgtrickle.setup_self_monitoring()` helper.** Single SQL call that creates all five `df_*` stream tables. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §7 Phase 4 | | DF-F5 | **`pgtrickle.teardown_self_monitoring()` helper.** Drops all `df_*` stream tables cleanly. | 1h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §7 Phase 4 | ### Phase 2 — Anomaly Detection | Item | Description | Effort | Ref | |------|-------------|--------|-----| | DF-A1 | **Create `df_anomaly_signals` (DF-2).** Detects duration spikes, error bursts, and mode oscillation by comparing recent behavior against DF-1 baselines. | 3–5h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §5 DF-2 | | DF-A2 | **Create `df_threshold_advice` (DF-3).** Multi-cycle threshold recommendation replacing the single-step `compute_adaptive_threshold()` convergence. | 3–5h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §5 DF-3 | | DF-A3 | **Verify DAG ordering.** DF-1 refreshes before DF-2 and DF-3. | 1–2h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §7 Phase 2 | | DF-A4 | **E2E test: threshold spike detection.** Inject synthetic history making DIFF consistently fast; assert DF-3 recommends raising the threshold. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §8 | | DF-A5 | **E2E test: anomaly duration spike.** Inject a 3× duration spike; assert DF-2 detects it. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §8 | ### Phase 3 — CDC Buffer & Interference | Item | Description | Effort | Ref | |------|-------------|--------|-----| | DF-C1 | **Create `df_cdc_buffer_trends` (DF-4).** Tracks change-buffer growth rates per source table. May require `pgtrickle.cdc_buffer_row_counts()` helper for dynamic table names. | 4–8h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §5 DF-4 | | DF-C2 | **Create `df_scheduling_interference` (DF-5).** Detects concurrent refresh overlap. FULL-refresh mode initially (bounded 1-hour window). | 3–5h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §5 DF-5 | | DF-C3 | **E2E test: scheduling overlap detection.** Create 3 STs with overlapping schedules; verify DF-5 detects overlap. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §8 | ### Phase 4 — GUC & Auto-Apply | Item | Description | Effort | Ref | |------|-------------|--------|-----| | DF-G1 | **`pg_trickle.self_monitoring_auto_apply` GUC.** Values: `off` (default) / `threshold_only` / `full`. Registered in `src/config.rs`. | 1–2h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §6.2 | | DF-G2 | **Auto-apply worker (threshold_only).** Post-tick hook reads `df_threshold_advice`; applies `ALTER STREAM TABLE ... SET auto_threshold = ` when confidence is HIGH and delta > 5%. Rate-limited to 1 change per ST per 10 minutes. | 4–8h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §7 Phase 5 | | DF-G3 | **`initiated_by = 'SELF_MONITOR'` audit trail.** Log auto-apply changes to `pgt_refresh_history`. | 1–2h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §7 Phase 5 | | DF-G4 | **E2E test: auto-apply threshold.** Enable `threshold_only`, inject history making DIFF consistently faster, verify threshold increases automatically. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §8 | | DF-G5 | **E2E test: rate limiting.** Verify no more than 1 threshold change per ST per 10 minutes. | 1–2h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §8 | ### Phase 5 — Operational Diagnostics | Item | Description | Effort | Ref | |------|-------------|--------|-----| | OPS-1 | **`pgtrickle.recommend_refresh_mode(st_name)`** Reads `df_threshold_advice` to return a structured recommendation `{ mode, confidence, reason }` rather than computing on demand. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §10.6 | | OPS-2 | **`check_cdc_health()` spill-risk enrichment.** Query `df_cdc_buffer_trends` growth rate; emit a `spill_risk` alert when buffer growth will breach `spill_threshold_blocks` within 2 cycles. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §10.3 | | OPS-3 | **`pgtrickle.scheduler_overhead()` diagnostic function.** Returns busy-time ratio, queue depth, avg dispatch latency, and fraction of CPU spent on DF STs vs user STs. | 2–4h | — | | OPS-4 | **`pgtrickle.explain_dag()` — Mermaid/DOT output.** Returns DAG as Mermaid markdown with node colours: user=blue, self-monitoring=green, suspended=red. | 3–4h | — | | OPS-5 | **`sql/self_monitoring_setup.sql` quick-start template.** Runnable script: call `setup_self_monitoring()`, set `self_monitoring_auto_apply = 'threshold_only'`, configure LISTEN, query initial recommendations. | 1h | — | | OPS-6 | **Workload-aware poll intervals via DF-5 signal.** Replace `compute_adaptive_poll_ms()` exponential backoff with pre-emptive dispatch interval widening when `df_scheduling_interference` detects contention. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §10.2 | | DASH-1 | **Grafana Self-Monitoring Dashboard.** New `monitoring/grafana/dashboards/pg_trickle_self_monitoring.json` — 5 panels reading from DF-1 through DF-5. | 4–6h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §10.5 | | DBT-1 | **dbt `pgtrickle_enable_monitoring` post-hook macro.** Calls `setup_self_monitoring()` automatically after a successful `dbt run`; documented in `dbt-pgtrickle/`. | 2h | — | **OPS-1 — `pgtrickle.recommend_refresh_mode(st_name text)`** > Reads directly from `df_threshold_advice` instead of computing a > single-cycle cost comparison on demand (PLAN_SELF_MONITORING.md §10.6). Returns > `TABLE(mode text, confidence text, reason text)`. When confidence is LOW > (< 10 history rows), emits a fallback with mode=`'AUTO'` and a reason > explaining insufficient data. Integrates with `explain_st()` output. > > Verify: call on an ST with ≥ 20 history cycles; assert `mode` ∈ > `{'DIFFERENTIAL','FULL','AUTO'}` and `confidence` ∈ `{'HIGH','MEDIUM','LOW'}`. > Dependencies: DF-A2. Schema change: No. **OPS-2 — `check_cdc_health()` spill-risk enrichment** > Currently `check_cdc_health()` performs full-table scans to detect anomalies. > When DF-C1 is active, query `df_cdc_buffer_trends` growth rate instead. > Emit a `spill_risk = 'IMMINENT'` row when the 1-cycle growth rate extrapolated > 2 cycles ahead exceeds `spill_threshold_blocks`. Falls back to full scan > when self-monitoring is not set up. > > Verify: inject 80% of `spill_threshold_blocks` worth of buffer rows with a > steep growth rate; assert `check_cdc_health()` returns a spill-risk alert. > Dependencies: DF-C1. Schema change: No. **OPS-3 — `pgtrickle.scheduler_overhead()` diagnostic function** > Returns a snapshot of scheduler efficiency: `scheduler_busy_ratio` (fraction > of wall-clock time spent executing refreshes), `queue_depth` (STs waiting > to be dispatched), `avg_dispatch_latency_ms`, `df_refresh_fraction` (fraction > of busy time attributable to DF STs). This makes PERF-3's < 1% CPU target > observable in production without custom monitoring. > > Verify: function returns non-NULL values after 5+ refresh cycles; assert > `df_refresh_fraction < 0.01` in the soak test context. > Dependencies: DF-D4. Schema change: No (new function only). **OPS-4 — `pgtrickle.explain_dag()` — Mermaid / DOT graph output** > Returns the full refresh DAG as a Mermaid markdown string (default) or > Graphviz DOT (via `format => 'dot'` argument). Node labels show ST name, > current mode, and refresh interval. Node colours: user STs = blue, > self-monitoring STs = green, suspended = red, fused = orange. Edges show > dependency direction. Validates that DF-1 → DF-2 → DF-3 ordering is > correct post-setup. > > Verify: `SELECT pgtrickle.explain_dag()` after `setup_self_monitoring()` returns > a string containing all five `df_` nodes in green with correct edges. > Dependencies: None. Schema change: No (new function only). **OPS-5 — `sql/self_monitoring_setup.sql` quick-start template** > A standalone SQL script in `sql/` that an operator can run with > `psql -f sql/self_monitoring_setup.sql`. Contents: calls `setup_self_monitoring()`, > sets `pg_trickle.self_monitoring_auto_apply = 'threshold_only'`, runs > `LISTEN pg_trickle_alert`, queries `self_monitoring_status()` for a status > summary, and queries `df_threshold_advice` for initial recommendations > with a warm-up note. Referenced from GETTING_STARTED.md Day 2 operations > section (UX-4). > > Verify: script executes without errors on a fresh install; produces visible > output showing 5 active DF STs. Dependencies: DF-F4, DF-G1, UX-4. > Schema change: No. **OPS-6 — Workload-aware poll intervals via DF-5 signal** > Currently `compute_adaptive_poll_ms()` uses pure exponential backoff that > reacts to contention only after it occurs. Replace this with a pre-emptive > signal: after each scheduler tick, read the latest `overlap_count` from > `df_scheduling_interference`; if `overlap_count >= 2`, increase the dispatch > interval for the next tick by 20% before dispatching (capped at > `pg_trickle.max_poll_interval_ms`). This closes the self-monitoring feedback loop > by letting the analytics directly influence scheduling policy, reducing > contention on write-heavy deployments without waiting for timeouts. > > Verify: soak test with known-contending STs shows lower `overlap_count` in > DF-5 with signal enabled vs disabled. `scheduler_overhead()` shows reduced > busy-time ratio. Dependencies: DF-C2, OPS-3. Schema change: No. **DASH-1 — Grafana Self-Monitoring Dashboard** > Add `monitoring/grafana/dashboards/pg_trickle_self_monitoring.json` alongside > the existing `pg_trickle_overview.json`. Five panels: (1) Refresh throughput > timeline (DF-1 `avg_diff_ms` over time), (2) Anomaly heatmap (DF-2 per-ST > anomaly type grid), (3) Threshold calibration scatter (DF-3 current vs > recommended threshold), (4) CDC buffer growth sparklines (DF-4 per-source > growth rate), (5) Interference matrix (DF-5 overlap heatmap). Provisioned > automatically in `monitoring/grafana/provisioning/`. > > Verify: `docker compose up` in `monitoring/` loads both dashboards; > all five panels resolve without `No data` errors using the postgres-exporter > queries. Dependencies: DF-F2, DF-A1, DF-A2, DF-C1, DF-C2. Schema change: No. **DBT-1 — `pgtrickle_enable_monitoring` dbt post-hook macro** > Add a `pgtrickle_enable_monitoring` macro to `dbt-pgtrickle/macros/` that > calls `{{ pgtrickle.setup_self_monitoring() }}` and emits a `log()` message > confirming activation. Documented in `dbt-pgtrickle/README.md`. Users add > `+post-hook: "{{ pgtrickle_enable_monitoring() }}"` to `dbt_project.yml` > to auto-enable monitoring after any `dbt run`. Idempotent — safe to call on > every run because `setup_self_monitoring()` is already idempotent (STAB-1). > > Verify: `just test-dbt` includes a test case that runs the macro twice; > asserts `self_monitoring_status()` shows 5 active STs after both calls. > Dependencies: DF-F4, STAB-1. Schema change: No. ### Documentation & Safety | Item | Description | Effort | Ref | |------|-------------|--------|-----| | DF-D1 | **SQL_REFERENCE.md: self-monitoring quick start.** Document `setup_self_monitoring()`, `teardown_self_monitoring()`, all five `df_*` stream tables, and the auto-apply GUC. | 2–4h | — | | DF-D2 | **CONFIGURATION.md: `pg_trickle.self_monitoring_auto_apply` GUC.** | 1h | — | | DF-D3 | **E2E test: control plane survives DF ST suspension.** Drop or suspend all `df_*` STs; verify the scheduler and refresh logic operate identically. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §8 | | DF-D4 | **Soak test addition.** Add self-monitoring STs to the existing soak test; verify no memory growth or scheduler stalls under 1-hour sustained load. | 2–4h | [PLAN_SELF_MONITORING.md](plans/PLAN_SELF_MONITORING.md) §8 | ### Correctness | ID | Title | Effort | Priority | |----|-------|--------|----------| | CORR-1 | `df_threshold_advice` output always within \[0.01, 0.80\] | S | P0 | | CORR-2 | DF-2 suppresses false-positive spike on first-ever refresh | S | P0 | | CORR-3 | `avg_change_ratio` never NaN/Inf on zero-delta streams | S | P0 | | CORR-4 | CDC INSERT-only invariant verified on `pgt_refresh_history` | XS | P1 | | CORR-5 | DF-1 historical window boundary is exclusive, not inclusive | XS | P1 | **CORR-1 — `df_threshold_advice` output always within \[0.01, 0.80\]** > The `LEAST(0.80, GREATEST(0.01, …))` expression in DF-3 must hold for all > input combinations including NULL `avg_diff_ms`, zero `avg_full_ms`, and > extreme ratios. Add a property-based test (proptest) that generates random > `(avg_diff_ms, avg_full_ms, current_threshold)` triples and asserts the > output is always in the valid range. Any value outside [0.01, 0.80] that > reaches auto-apply would corrupt stream table configuration. > > Verify: proptest with 10,000 iterations; zero out-of-range results. > Dependencies: DF-A2. Schema change: No. **CORR-2 — DF-2 suppresses false-positive spike on first-ever refresh** > `df_anomaly_signals` compares `latest.duration_ms` against `eff.avg_diff_ms`. > On the very first refresh of a stream table there is no rolling average yet > (`eff.avg_diff_ms IS NULL`), so the `CASE WHEN` would produce no anomaly. > Confirm the LATERAL subquery returns NULL (not 0) when history is empty, > and that the `CASE` guard is `> 3.0 * NULLIF(eff.avg_diff_ms, 0)` so a > NULL baseline never triggers a spike. > > Verify: E2E test creating a brand-new ST; assert `duration_anomaly IS NULL` > on first DF-2 refresh. Dependencies: DF-A1. Schema change: No. **CORR-3 — `avg_change_ratio` never NaN/Inf on zero-delta streams** > DF-1 computes `avg(h.delta_row_count::float / NULLIF(h.rows_inserted + > h.rows_deleted, 0))`. If a stream table runs only FULL refreshes (no DIFF > cycles) the divisor is always NULL and `avg()` returns NULL — correct. But > if DIFF runs with exactly zero rows inserted and zero deleted (CDC buffer was > empty), `NULLIF` must prevent a divide-by-zero NaN. Verify the guard holds > and that `avg_change_ratio` is either a valid float in [0, 1] or NULL. > > Verify: E2E test triggering a DIFF refresh on a quiescent source; assert > `avg_change_ratio IS NULL OR avg_change_ratio BETWEEN 0 AND 1`. > Dependencies: DF-F2. Schema change: No. **CORR-4 — CDC INSERT-only invariant verified on `pgt_refresh_history`** > `pgt_refresh_history` is semantically append-only: rows are only ever > INSERTed (one per refresh). The CDC trigger installed by DF-F1 must be > an INSERT-only trigger (no UPDATE/DELETE triggers). If the trigger were > registered as `FOR EACH ROW AFTER INSERT OR UPDATE`, a future catalog UPDATE > would generate spurious change-buffer rows and corrupt DF-1 aggregates. > Inspect `pg_trigger` to confirm only an `INSERT` trigger exists. > > Verify: `SELECT tgtype FROM pg_trigger WHERE tgrelid = 'pgtrickle.pgt_refresh_history'::regclass` > returns only INSERT-event triggers. Dependencies: DF-F1. Schema change: No. **CORR-5 — DF-1 historical window boundary is exclusive, not inclusive** > The `WHERE h.start_time > now() - interval '1 hour'` clause uses a > strict `>` comparison. This ensures a row with `start_time` exactly > equal to the boundary is excluded on each pass, preventing double-counting > in rolling aggregates. Confirm the query plan uses the index on > `(pgt_id, start_time)` (see PERF-2) and that the boundary is consistent > across DF-1, DF-2, and DF-4 (all use the same 1-hour lookback). > > Verify: unit test comparing aggregate output with a row at the exact boundary; > assert it is excluded. Dependencies: DF-F2. Schema change: No. --- ### Stability | ID | Title | Effort | Priority | |----|-------|--------|----------| | STAB-1 | `setup_self_monitoring()` is fully idempotent | S | P0 | | STAB-2 | Auto-apply handles `ALTER STREAM TABLE` failure gracefully | S | P0 | | STAB-3 | DF STs survive `DROP EXTENSION` + `CREATE EXTENSION` cycle | S | P1 | | STAB-4 | Auto-apply worker checks ST still exists before applying | XS | P1 | | STAB-5 | `teardown_self_monitoring()` is safe when some DF STs already removed | XS | P1 | **STAB-1 — `setup_self_monitoring()` is fully idempotent** > Calling `setup_self_monitoring()` a second time while DF STs already exist must > not raise an error. Use `IF NOT EXISTS` semantics internally (or check catalog > before creating). The function must also be safe to call concurrently from > two sessions. Idempotency is critical for upgrade scripts and Terraform-style > declarative deployment workflows. > > Verify: call `setup_self_monitoring()` three times in a row; no errors, no > duplicate stream tables. Dependencies: DF-F4. Schema change: No. **STAB-2 — Auto-apply handles `ALTER STREAM TABLE` failure gracefully** > The auto-apply post-tick hook reads `df_threshold_advice` and issues > `ALTER STREAM TABLE … SET auto_threshold = `. If the stream > table was dropped between the advice read and the apply (a TOCTOU race), > the ALTER will error. Catch SQL errors in the post-tick hook with an > appropriate `match` on `PgTrickleError` and log a WARNING rather than > crashing the background worker. > > Verify: unit test with a mocked `ALTER` that returns `ERROR: relation does > not exist`; assert the worker logs a warning and continues to the next > advice row. Dependencies: DF-G2. Schema change: No. **STAB-3 — DF STs survive `DROP EXTENSION` + `CREATE EXTENSION` cycle** > `DROP EXTENSION pg_trickle CASCADE` drops all extension-owned objects. > After `CREATE EXTENSION pg_trickle`, `setup_self_monitoring()` should recreate > the DF STs cleanly. There must be no leftover triggers, orphaned change > buffer tables, or stale catalog rows from the previous installation. This > is the most likely failure mode after an emergency rollback + reinstall. > > Verify: E2E test: `setup_self_monitoring()` → `DROP EXTENSION CASCADE` → > `CREATE EXTENSION` → `setup_self_monitoring()` → insert history → refresh DF-1; > assert correct aggregates. Dependencies: DF-F4, DF-F5. Schema change: No. **STAB-4 — Auto-apply worker checks ST still exists before applying** > Before issuing `ALTER STREAM TABLE`, the worker should confirm the ST is > still in `pgt_stream_tables` and is not in SUSPENDED or FUSED state. Applying > a threshold change to a SUSPENDED ST is harmless but wasteful; applying to a > FUSED ST is wrong (the fuse exists for a reason). Add a pre-apply guard in > the Rust post-tick hook. > > Verify: E2E test suspending an ST manually while auto-apply is enabled; > assert no threshold change is applied-to a suspended stream table. > Dependencies: DF-G2. Schema change: No. **STAB-5 — `teardown_self_monitoring()` is safe when some DF STs already removed** > If a user manually drops `df_anomaly_signals` before calling > `teardown_self_monitoring()`, the teardown function must not error on `DROP > STREAM TABLE df_anomaly_signals`. Use `drop_stream_table(name, if_exists > => true)` semantics for each DF table in the teardown. Otherwise a partial > teardown leaves the system in an inconsistent state. > > Verify: drop two DF STs manually, then call `teardown_self_monitoring()`; assert > no errors and remaining DF STs are gone. Dependencies: DF-F5. Schema change: No. --- ### Performance | ID | Title | Effort | Priority | |----|-------|--------|----------| | PERF-1 | Index on `pgt_refresh_history(pgt_id, start_time)` for DF queries | XS | P0 | | PERF-2 | Benchmark DF-1 vs `refresh_efficiency()` on 10 K history rows | S | P0 | | PERF-3 | Dog-feeding scheduler overhead target: < 1% of total CPU | S | P1 | | PERF-4 | DF-5 self-join uses bounded index scan, not seq-scan | S | P1 | | PERF-5 | History pruning batch-DELETE with short transactions (no CDC lock contention) | S | P1 | | PERF-6 | Columnar change tracking Phase 1 — CDC bitmask (deferred from v0.17/v0.18) | M | P1 | **PERF-1 — Index on `pgt_refresh_history(pgt_id, start_time)` for DF queries** > All five DF stream tables filter `pgt_refresh_history` on `(pgt_id, > start_time)`. Without a composite index on these columns the rolling-window > WHERE clause forces a sequential scan of the growing history table. Verify > the index was created during extension install (check the upgrade migration); > if missing, add it as part of the 0.19.0 → 0.20.0 migration script. > > Verify: `EXPLAIN (FORMAT TEXT) SELECT … FROM pgtrickle.pgt_refresh_history > WHERE pgt_id = 1 AND start_time > now() - interval '1 hour'` shows an index > scan. Schema change: Yes (index addition in migration script). **PERF-2 — Benchmark DF-1 vs `refresh_efficiency()` on 10 K history rows** > The primary performance claim of self-monitoring is that a maintained DIFFERENTIAL > stream table is cheaper than scanning the full history table on every > diagnostic call. Establish a Criterion micro-benchmark that seeds 10 K history > rows, then compares: (a) a full `SELECT * FROM pgtrickle.refresh_efficiency()` > call vs (b) a `SELECT * FROM pgtrickle.df_efficiency_rolling` read after one > incremental refresh. The benchmark documents the win concretely. > > Verify: Criterion benchmark shows DF-1 read is at least 5× faster than > `refresh_efficiency()` at 10 K rows. Included in `benches/` and run in CI. > Dependencies: DF-F2. Schema change: No. **PERF-3 — Dog-feeding scheduler overhead target: < 1% of total CPU** > Five DF STs at 48–96 s schedules add background refresh work. Under a > realistic load (20 user STs, 10 K history rows), the total time spent > refreshing DF STs should be < 1% of total scheduler CPU. Measure in the > E2E soak test by comparing scheduler loop busy-time with and without DF STs. > If overhead exceeds 1%, relax schedules to 120 s or move DF STs to > `refresh_tier = 'cold'`. > > Verify: soak test reports DF refresh overhead as a fraction of total > scheduler CPU; assert < 1%. Dependencies: DF-D4. Schema change: No. **PERF-4 — DF-5 self-join uses bounded index scan, not seq-scan** > `df_scheduling_interference` joins `pgt_refresh_history` to itself on an > overlap condition with a 1-hour bound. Without the index from PERF-1 this > double-scan is O(N²) in history rows. Verify EXPLAIN shows nested-loop > index scans (not hash or merge join over full table) for both sides of the > self-join. If the planner chooses a seq-scan, add `enable_seqscan = off` > for the DF-5 query or restructure with a CTE. > > Verify: EXPLAIN of DF-5 query shows index scans on both sides of the JOIN. > Dependencies: PERF-1, DF-C2. Schema change: No. **PERF-5 — History pruning batch-DELETE with short transactions** > `pg_trickle.history_retention_days` cleanup (shipped in v0.19.0) currently > deletes rows in a single long transaction. Under self-monitoring, that transaction > holds a lock on `pgt_refresh_history` that can delay CDC trigger INSERTs. > Rewrite the purge as batched DELETEs: delete at most 500 rows per > transaction, commit between batches, sleep 50 ms between batches. The index > from PERF-1 ensures each batch is an index-range scan, not a seq-scan. > > Verify: soak test running history purge concurrently with DF CDC trigger > INSERTs; no lock wait timeout observed. Batch size configurable via > `pg_trickle.history_purge_batch_size` GUC (default 500). > Dependencies: PERF-1. Schema change: No. **PERF-6 — Columnar change tracking Phase 1 — CDC bitmask** > *Deferred from v0.17.0 (twice) and v0.18.0.* Dog-feeding now provides > concrete internal workload data that justifies the schema change. Phase 1 > only: compute `changed_columns` bitmask (`old.col IS DISTINCT FROM new.col`) > in the CDC trigger for UPDATE rows; store as `int8` in the change buffer. > Phase 2 (delta-scan filtering using the bitmask) deferred to v0.22.0. > Gate behind `pg_trickle.columnar_tracking` GUC (default `off`). This is the > foundation for 50–90% delta volume reduction on wide-table UPDATE workloads. > > Verify: UPDATE a 20-column row, changing 2 columns; assert `changed_columns` > bitmask has exactly 2 bits set. `just check-upgrade-all` passes. > Dependencies: None. Schema change: Yes (change buffer schema addition + migration script). --- ### Scalability | ID | Title | Effort | Priority | |----|-------|--------|----------| | SCAL-1 | DF STs refresh within window at 100 user stream tables | S | P1 | | SCAL-2 | `pgt_refresh_history` retention interacts correctly with self-monitoring | S | P1 | | SCAL-3 | 1-hour rolling window doesn't over-aggregate when history is sparse | XS | P2 | **SCAL-1 — DF STs refresh within window at 100 user stream tables** > With 100 user STs generating up to 100 history rows per 48 s window, DF-1 > processes up to ~7,500 rows/hour. Verify that the DIFFERENTIAL refresh of > DF-1 completes within its 48 s schedule interval at this load, leaving > margin for DF-2 and DF-3. If DF-1 duration exceeds 10 s, investigate query > plan and index usage. Run as part of the soak-test at high table count. > > Verify: soak test with 100 STs; DF-1 refresh duration < 10 s throughout. > Dependencies: PERF-1. Schema change: No. **SCAL-2 — `pgt_refresh_history` retention interacts correctly with self-monitoring** > `pg_trickle.history_retention_days` (shipped in v0.19.0, default 90 days) > purges old history rows. DF-1 only looks back 1 hour, so retention does > not affect correctness. However the purge job must not hold a long-running > lock that delays CDC trigger firing on concurrent INSERT into the history > table. Verify that the cleanup job uses a DELETE … RETURNING batch strategy > with short transactions to avoid blocking DF CDC triggers. > > Verify: E2E test running the history purge job while DF-1 is being refreshed; > no lock wait timeout, no CDC trigger delay. Dependencies: DF-F1. Schema change: No. **SCAL-3 — 1-hour rolling window doesn't over-aggregate when history is sparse** > For a stream table that refreshes every 30 minutes (2 refreshes/hour), the > DF-1 1-hour window contains at most 2 rows. The `AVG()` aggregate is still > meaningful, but `percentile_cont(0.95)` over 2 rows is misleading. Document > the minimum sample size (in the `confidence` column of DF-3) and add a note > in SQL_REFERENCE.md that DF stats are most meaningful for STs refreshing > every 60 s or faster. > > Verify: SQL_REFERENCE.md updated; `confidence = 'LOW'` for STs with > `total_refreshes < 10`. Dependencies: DF-A2. Schema change: No. --- ### Ease of Use | ID | Title | Effort | Priority | |----|-------|--------|----------| | UX-1 | `pgtrickle.self_monitoring_status()` diagnostic function | S | P0 | | UX-2 | `setup_self_monitoring()` warm-up hint when history is sparse | XS | P1 | | UX-3 | NOTIFY on anomaly via `pg_trickle_alert` channel | S | P1 | | UX-4 | GETTING_STARTED.md: "Day 2 operations" section | S | P1 | | UX-5 | `explain_st()` shows if a DF ST covers the queried stream table | XS | P2 | | UX-6 | `recommend_refresh_mode()` exposed in `explain_st()` JSON output | XS | P2 | | UX-7 | `scheduler_overhead()` output included in TUI diagnostics panel | XS | P2 | | UX-8 | `df_threshold_advice` extended with SLA headroom column | S | P2 | **UX-1 — `pgtrickle.self_monitoring_status()` diagnostic function** > A single-query overview of the self-monitoring analytics plane: name, last > refresh timestamp, row count, and whether the DF ST is ACTIVE / SUSPENDED / > NOT_CREATED. Calling this function is the first thing an operator should run > to check that self-monitoring is working. Return type: `TABLE(df_name text, > status text, last_refresh timestamptz, row_count bigint, note text)`. > > Verify: function returns 5 rows when all DF STs are active; returns rows with > `status = 'NOT_CREATED'` when `setup_self_monitoring()` has not been called. > Schema change: No (new function only). **UX-2 — `setup_self_monitoring()` warm-up hint when history is sparse** > If `pgt_refresh_history` has fewer than 50 rows when `setup_self_monitoring()` > is called, emit a NOTICE: `"Dog-feeding stream tables created. DF analytics > will populate as refresh history accumulates (currently N rows; recommend > ≥ 50 before consulting df_threshold_advice)."` This prevents operators from > acting on meaningless LOW-confidence advice immediately after setup. > > Verify: call `setup_self_monitoring()` on a fresh install; assert NOTICE contains > the row count and the ≥ 50 recommendation. Dependencies: DF-F4. Schema change: No. **UX-3 — NOTIFY on anomaly via `pg_trickle_alert` channel** > When `df_anomaly_signals` detects a `duration_anomaly IS NOT NULL` or > `recent_failures >= 2` after a refresh, emit a `pg_notify('pg_trickle_alert', > payload::text)` with `event = 'self_monitor_anomaly'`, the stream table name, > anomaly type, last duration, baseline, and a plain-English recommendation. > This integrates with existing alert pipelines without requiring a new channel. > Fires from a post-refresh trigger on `df_anomaly_signals` or from the > auto-apply post-tick hook. > > Verify: E2E test LISTEN on `pg_trickle_alert`; inject a 3× duration spike; > assert NOTIFY payload arrives with correct anomaly type. Dependencies: > DF-A1. Schema change: No. **UX-4 — GETTING_STARTED.md: "Day 2 operations" section** > Add a new section to `docs/GETTING_STARTED.md` covering the first steps > after initial deployment: (1) enable self-monitoring with `setup_self_monitoring()`, > (2) check status with `self_monitoring_status()`, (3) query `df_threshold_advice` > to tune thresholds, (4) set up anomaly alerting via LISTEN. This gives new > users a clear post-install checklist and demonstrates the self-monitoring value > proposition immediately. > > Verify: documentation PR reviewed; code examples in GETTING_STARTED.md > execute without modification. Dependencies: UX-1, UX-2. Schema change: No. **UX-5 — `explain_st()` shows if a DF ST covers the queried stream table** > When a user calls `pgtrickle.explain_st('my_table')`, append a line > `"Dog-feeding coverage: df_efficiency_rolling ✓, df_threshold_advice ✓"` (or > `"Not set up — run setup_self_monitoring()"`) to the output. This surfaces the > analytics plane to users who might not know self-monitoring exists, without > requiring a separate function call. > > Verify: `SELECT explain_st('any_table')` output includes a `self_monitoring` > field in the JSON output. Dependencies: UX-1. Schema change: No. **UX-8 — `df_threshold_advice` extended with SLA headroom column** > Extend the DF-3 defining query to include a computed `sla_headroom_ms` > column: `freshness_deadline_ms - avg_diff_ms` from `pgt_refresh_history`. > When `sla_headroom_ms < 0`, add a boolean `sla_breach_risk = true` flag so > operators can see at a glance which STs risk missing their freshness SLA on > the next DIFFERENTIAL cycle. The `freshness_deadline` column already exists > in `pgt_refresh_history` (since v0.2.3). No schema change required. > > Verify: create an ST with a tight `freshness_deadline`; run slow synthetic > refreshes; assert `df_threshold_advice.sla_breach_risk = true`. > Dependencies: DF-A2. Schema change: No (view column addition only). **UX-6 — `recommend_refresh_mode()` exposed in `explain_st()` JSON output** > `explain_st()` already shows self-monitoring coverage (UX-5). Extend its JSON > output with a `recommended_mode` field reading from `df_threshold_advice` > (OPS-1). If OPS-1 is not available (no DF setup), fall back to `null` with > a `setup_self_monitoring()` hint. Keeps the single-function diagnostic surface > comprehensive without requiring separate calls. > > Verify: `SELECT explain_st('any_table')` JSON includes `recommended_mode` > and `mode_confidence` fields. Dependencies: OPS-1. Schema change: No. **UX-7 — `scheduler_overhead()` output included in TUI diagnostics panel** > The TUI (`pgtrickle-tui`) already shows refresh latency sparklines and ST > status. Add a diagnostics panel (toggle key `D`) showing the fields from > `scheduler_overhead()`: busy ratio, queue depth, and DF fraction as a > percentage. Gives operators hands-on observability without needing psql. > > Verify: TUI diagnostics panel shows all three scheduler overhead fields; > `df_refresh_fraction` updates after each DF refresh cycle. > Dependencies: OPS-3. Schema change: No. --- ### Test Coverage | ID | Title | Effort | Priority | |----|-------|--------|----------| | TEST-1 | Property test: DF-3 recommended threshold always ∈ \[0.01, 0.80\] | S | P0 | | TEST-2 | Light E2E: self-monitoring create/refresh/teardown full cycle | S | P0 | | TEST-3 | Upgrade test: `pgt_refresh_history` rows survive `0.19.0 → 0.20.0` | S | P0 | | TEST-4 | Regression test: DF STs absent from `check_cdc_health()` anomaly list | XS | P1 | | TEST-5 | Stability test: self-monitoring under 1-h soak with 50 user STs | M | P1 | | TEST-6 | Light E2E: `setup_self_monitoring()` idempotency (3× call) | XS | P1 | **TEST-1 — Property test: DF-3 recommended threshold always ∈ \[0.01, 0.80\]** > Implements CORR-1 as a `proptest` unit test. Generate random > `(avg_diff_ms: 0.0–100_000.0, avg_full_ms: 0.0–100_000.0, current: 0.01–0.80)` > triples, compute the DF-3 CASE expression in Rust, assert output ∈ [0.01, 0.80]. > Can be a pure Rust unit test in `src/refresh.rs` alongside the existing > `compute_adaptive_threshold` tests — no database required. > > Verify: `just test-unit` passes; 10,000 proptest iterations with zero failures. > Dependencies: CORR-1. Schema change: No. **TEST-2 — Light E2E: self-monitoring create/refresh/teardown full cycle** > A light E2E test (stock `postgres:18.3` container) that: (1) installs the > extension, (2) creates 3 user STs, (3) runs 5 refresh cycles to populate > history, (4) calls `setup_self_monitoring()`, (5) refreshes all DF STs once, > (6) asserts `self_monitoring_status()` shows 5 active STs, (7) calls > `teardown_self_monitoring()`, (8) asserts all DF STs are gone. > > Verify: test passes in `just test-light-e2e` with zero assertions failed. > Schema change: No. **TEST-3 — Upgrade test: `pgt_refresh_history` rows survive `0.19.0 → 0.20.0`** > The 0.19.0 → 0.20.0 migration adds an index to `pgt_refresh_history` (PERF-1). > The upgrade must not truncate, reorder, or modify existing history rows. > Write an upgrade E2E test: deploy 0.19.0, run 10 refreshes, `ALTER EXTENSION > pg_trickle UPDATE`, assert all 10 history rows are intact and the new index > exists. > > Verify: upgrade E2E test passes; `SELECT count(*) FROM pgt_refresh_history` > unchanged after upgrade. Schema change: Yes (index). **TEST-4 — Regression test: DF STs absent from `check_cdc_health()` anomaly list** > `pgtrickle.check_cdc_health()` scans all stream tables for CDC anomalies. > After `setup_self_monitoring()`, DF STs must not appear in the anomaly list > just because they are refreshed at longer intervals (48–96 s). Their > schedules must be recognised as intentionally relaxed, not "falling behind". > > Verify: E2E test: `setup_self_monitoring()` → wait one full DF cycle → assert > `check_cdc_health()` returns no anomalies for any `df_` table. Dependencies: > DF-F4. Schema change: No. **TEST-5 — Stability test: self-monitoring under 1-h soak with 50 user STs** > Extends DF-D4. Runs 50 user STs + 5 DF STs for 1 hour under steady insert > load (1 000 rows/min across all sources). Assertions: (a) all DF STs remain > ACTIVE, (b) no OOM or background worker crash, (c) DF-1 avg refresh duration > < 5 s throughout, (d) `pgtrickle.self_monitoring_status()` shows 5 active STs > at end of run. > > Verify: soak test passes with all four assertions. Dependencies: DF-D4, > SCAL-1. Schema change: No. **TEST-6 — Light E2E: `setup_self_monitoring()` idempotency (3× call)** > Implements STAB-1 as a light E2E test. Call `setup_self_monitoring()` three > consecutive times in the same session. Assert: no errors, exactly five > `df_` stream tables in `pgt_stream_tables`, no duplicate triggers in > `pg_trigger` for history table. > > Verify: test passes in `just test-light-e2e`; `SELECT count(*) FROM > pgtrickle.pgt_stream_tables WHERE pgt_name LIKE 'df_%' = 5` after all three calls. > Dependencies: STAB-1. Schema change: No. --- ### Conflicts & Risks 1. **PERF-1 (index addition) requires a migration script change.** Adding `CREATE INDEX CONCURRENTLY` to the 0.19.0 → 0.20.0 migration must be tested with `just check-upgrade-all`. `CONCURRENTLY` cannot run inside a transaction block — the migration must issue it outside the default single-transaction DDL wrapper. 2. **UX-3 (NOTIFY on anomaly) fires from a post-refresh path.** If the `pg_notify()` call fails (e.g., payload too large), it must not roll back the DF-2 refresh. Wrap the notify in a `BEGIN … EXCEPTION WHEN OTHERS THEN NULL END` block, or fire it from a deferred trigger. 3. **STAB-3 (DROP EXTENSION cycle) requires DF STs to be extension-owned or cleanly unregistered.** If DF STs are not extension-owned objects, `DROP EXTENSION CASCADE` will not drop them. Either register them as extension members or document that `teardown_self_monitoring()` must be called before `DROP EXTENSION`. 4. **TEST-5 (soak test) overlaps with the existing soak test in CI.** Add it to the daily `stability-tests.yml` workflow rather than `ci.yml` to avoid extending PR CI time. Mark with `#[ignore]` and trigger via `just test-soak`. 5. **CORR-5 / PERF-4 interaction.** The `start_time > now() - interval '1 hour'` boundary and the index depend on the planner choosing an index range scan. On very busy deployments where the cardinality estimate is off, the planner may prefer a seq-scan. Consider adding `SET enable_seqscan = off` inside the DF stream table queries if plan stability is a concern. 6. **PERF-6 (columnar tracking) is a schema change — deferred twice already.** The `changed_columns` column addition to all change buffer tables requires a migration script. Gate strictly behind `pg_trickle.columnar_tracking = off` default. If capacity is tight, PERF-6 can be cut from v0.20.0 without affecting any other item — it shares no code paths with the DF pipeline. 7. **OPS-2 (`check_cdc_health()` enrichment) has a fallback requirement.** When `setup_self_monitoring()` has not been called, the function must fall back to the old full-scan path without error. Guard with a catalog check for `df_cdc_buffer_trends` existence before querying it. 8. **OPS-4 (`explain_dag()`) output size.** At 100+ user STs the Mermaid output may exceed typical terminal width. Offer `format => 'dot'` and `limit => N` arguments to constrain output. Default `format => 'mermaid'` with a `NOTICE` when DAG has > 20 nodes. 9. **OPS-6 (workload-aware poll) writes to the scheduler hot path.** The `compute_adaptive_poll_ms()` function is called on every scheduler tick. The DF-5 read must be a single O(1) catalog lookup (latest row only), not a full table scan. Guard with `LIMIT 1 ORDER BY collected_at DESC`. If the DF-5 table does not exist (self-monitoring not set up), fall back to the old backoff logic without error. 10. **DASH-1 (Grafana) depends on postgres-exporter SQL queries.** The dashboard panels use custom SQL collectors in the postgres-exporter config. Verify that `monitoring/` docker-compose already mounts query config; if not, add a `pg_trickle_df_queries.yaml` collector file alongside the existing exporter config. 11. **DBT-1 macro idempotency.** The `pgtrickle_enable_monitoring` macro calls `setup_self_monitoring()` on every `dbt run`. Document that this is intentionally safe (STAB-1) and adds < 5 ms overhead per run. > **v0.20.0 total: ~3–4 weeks** **Exit criteria:** - [x] DF-F1: `pgt_refresh_history` receives CDC INSERT triggers when `create_stream_table()` is called - [x] DF-F2: `df_efficiency_rolling` created and refreshes correctly in DIFFERENTIAL mode - [x] DF-F3: DF-1 output matches `refresh_efficiency()` results on synthetic history - [x] DF-F4: `setup_self_monitoring()` creates all five `df_*` stream tables in one call - [x] DF-F5: `teardown_self_monitoring()` drops all `df_*` tables cleanly with no orphaned triggers - [x] DF-A1: `df_anomaly_signals` created and detects 3× duration spikes - [x] DF-A2: `df_threshold_advice` provides HIGH-confidence recommendations after ≥ 20 refresh cycles - [x] DF-A3: DAG ensures DF-1 refreshes before DF-2 and DF-3 in every scheduler tick - [x] DF-C1: `df_cdc_buffer_trends` created (FULL or DIFFERENTIAL mode) - [x] DF-C2: `df_scheduling_interference` detects overlapping concurrent refreshes - [x] DF-G1: `pg_trickle.self_monitoring_auto_apply` GUC registered with default `off` - [x] DF-G2: Auto-apply adjusts threshold with ≥ 1 confirmed change in E2E test - [x] DF-G5: Rate limiting verified — no more than 1 change per ST per 10 minutes - [x] DF-D3: Suspending all `df_*` STs does not affect control-plane operation - [x] CORR-1: `df_threshold_advice` output always within [0.01, 0.80] (property test) - [x] CORR-2: No false-positive DURATION_SPIKE on first-ever refresh of a new ST - [x] CORR-3: `avg_change_ratio` is NULL or in [0, 1] for zero-delta sources - [x] CORR-4: Only INSERT triggers (no UPDATE/DELETE) on `pgt_refresh_history` - [x] STAB-1: `setup_self_monitoring()` called 3× produces no errors and no duplicates - [x] STAB-2: Auto-apply worker logs WARNING (not panic) when ALTER target disappears - [x] STAB-3: DROP EXTENSION + CREATE EXTENSION + `setup_self_monitoring()` cycle works cleanly - [x] PERF-1: `pgt_refresh_history(pgt_id, start_time)` index exists and is used by DF queries - [x] PERF-2: DF-1 read ≥ 5× faster than `refresh_efficiency()` at 10 K history rows - [x] UX-1: `pgtrickle.self_monitoring_status()` returns correct status for all five DF STs - [x] UX-2: `setup_self_monitoring()` emits warm-up NOTICE when history has < 50 rows - [x] UX-3: `pg_trickle_alert` NOTIFY received within one DF cycle after a 3× duration spike - [x] TEST-1: Proptest for DF-3 threshold bounds passes 10,000 iterations - [x] TEST-2: Light E2E full cycle test passes - [x] TEST-3: Upgrade E2E: history rows intact and index present after `0.19.0 → 0.20.0` - [x] TEST-4: `check_cdc_health()` reports no anomalies for `df_*` tables after setup - [x] OPS-1: `recommend_refresh_mode()` returns `mode` ∈ `{'DIFFERENTIAL','FULL','AUTO'}` and `confidence` ∈ `{'HIGH','MEDIUM','LOW'}` - [x] OPS-2: `check_cdc_health()` returns spill-risk alert when buffer growth rate extrapolates to breach threshold within 2 cycles - [x] OPS-3: `scheduler_overhead()` returns non-NULL fields after ≥ 5 refresh cycles; `df_refresh_fraction < 0.01` in soak test - [x] OPS-4: `explain_dag()` output contains all five `df_*` nodes after `setup_self_monitoring()` - [x] OPS-5: `sql/self_monitoring_setup.sql` executes without errors on a fresh install - [x] PERF-5: Concurrent history purge + DF CDC INSERT produces no lock wait timeouts in soak test - [x] PERF-6: `changed_columns` bitmask stored in change buffer for UPDATE rows when `columnar_tracking = on` (if included) - [x] OPS-6: Soak test shows lower `overlap_count` in DF-5 with workload-aware poll enabled vs disabled - [x] DASH-1: `docker compose up` in `monitoring/` loads pg_trickle_self_monitoring dashboard; all 5 panels show data - [x] DBT-1: `pgtrickle_enable_monitoring` macro runs twice without error; `self_monitoring_status()` shows 5 active STs after both calls - [x] UX-8: `df_threshold_advice.sla_breach_risk = true` when `avg_diff_ms > freshness_deadline_ms` on synthetic data - [x] Extension upgrade path tested (`0.19.0 → 0.20.0`) - [x] `just check-version-sync` passes ---