[← Back to Blog Index](README.md) # pg_trickle Monitors Itself ## How the extension eats its own cooking --- Since v0.20, pg_trickle's internal health metrics are maintained as stream tables. The extension uses itself to monitor itself. This isn't a gimmick. It's a practical architecture decision: pg_trickle already has an engine for maintaining derived data from source tables. The extension's own operational data lives in PostgreSQL tables. Why build a separate monitoring system when you can point the same engine at its own catalog? --- ## What Self-Monitoring Tracks When you enable self-monitoring, pg_trickle creates a set of internal stream tables that aggregate operational data: ```sql -- Enable self-monitoring SELECT pgtrickle.enable_self_monitoring(); ``` This creates stream tables over pg_trickle's own catalog and operational tables: ### Refresh performance by stream table ```sql -- Automatically created: -- pgtrickle._self_refresh_stats -- Query: SELECT st.pgt_name, st.refresh_mode, COUNT(*) AS refresh_count, AVG(h.duration_ms) AS avg_refresh_ms, MAX(h.duration_ms) AS max_refresh_ms, SUM(h.rows_affected) AS total_rows_affected, MAX(h.refreshed_at) AS last_refresh FROM pgtrickle.pgt_stream_tables st JOIN pgtrickle.pgt_refresh_history h ON h.pgt_id = st.pgt_id WHERE h.refreshed_at >= now() - interval '1 hour' GROUP BY st.pgt_name, st.refresh_mode ``` This stream table tells you, in real time, how each stream table is performing: average refresh time, max refresh time, total rows processed, and when it last refreshed. ### Change buffer depth ```sql -- pgtrickle._self_buffer_depth -- Tracks how much unprocessed change data is queued per source table SELECT source_table, COUNT(*) AS pending_changes, MIN(captured_at) AS oldest_change, now() - MIN(captured_at) AS max_latency FROM pgtrickle_changes.changes_summary GROUP BY source_table ``` If `pending_changes` is growing faster than the scheduler can drain it, this stream table surfaces the problem before it becomes a production incident. ### Error rates ```sql -- pgtrickle._self_error_rates -- Tracks consecutive errors and error patterns SELECT st.pgt_name, st.consecutive_errors, st.status, h.error_message, h.refreshed_at AS last_error_at FROM pgtrickle.pgt_stream_tables st JOIN pgtrickle.pgt_refresh_history h ON h.pgt_id = st.pgt_id WHERE h.success = false AND h.refreshed_at >= now() - interval '1 hour' ``` --- ## Why This Matters The typical monitoring setup for a database extension involves: 1. A Prometheus exporter that polls the extension's views 2. Grafana dashboards that visualize the metrics 3. AlertManager rules that fire when thresholds are breached This works, but there's a lag: Prometheus scrapes every 15–30 seconds. By the time the alert fires, the problem might have been happening for a minute. Self-monitoring stream tables are maintained continuously — every 2–5 seconds. And because they're just PostgreSQL tables, you can: - Query them with arbitrary SQL - Build other stream tables on top of them (meta-monitoring) - Subscribe to them via the outbox for real-time alerts - Expose them to any BI tool that speaks PostgreSQL ### Alerts on monitoring data ```sql -- Alert when any stream table's refresh time exceeds 500ms SELECT pgtrickle.create_stream_table( 'pgtrickle._self_slow_refreshes', $$SELECT pgt_name, avg_refresh_ms, max_refresh_ms FROM pgtrickle._self_refresh_stats WHERE avg_refresh_ms > 500$$, schedule => '5s', refresh_mode => 'DIFFERENTIAL' ); -- Enable outbox so the relay can push alerts to Slack/PagerDuty SELECT pgtrickle.enable_outbox('pgtrickle._self_slow_refreshes'); ``` Now when a stream table starts refreshing slowly, the monitoring stream table picks it up within 5 seconds, and the outbox delivers a notification to your alerting system. --- ## The Recursion Question "If pg_trickle monitors itself with stream tables, who monitors the monitoring stream tables?" Fair question. The self-monitoring stream tables are maintained by the same scheduler as user-defined stream tables. If the scheduler is completely dead, the monitoring tables aren't updated either. pg_trickle handles this with a separate code path: 1. **The scheduler's heartbeat** is written directly to a catalog table (not via a stream table). External monitoring (Prometheus, health check endpoints) reads this heartbeat. 2. **Self-monitoring stream tables** track performance and trends — they're for observability, not liveness. The liveness check is the heartbeat. 3. **`pgtrickle.health_check()`** returns a summary that combines both: the heartbeat (is the scheduler running?) and the self-monitoring data (how is it performing?). ```sql SELECT * FROM pgtrickle.health_check(); -- Returns: scheduler_running, total_stream_tables, active_count, -- error_count, avg_refresh_ms, max_staleness, ... ``` --- ## Integration with Prometheus and Grafana Self-monitoring doesn't replace Prometheus — it complements it. pg_trickle also exposes metrics in Prometheus format: ``` # Prometheus endpoint (built-in or via pgtrickle-relay) pgtrickle_refresh_duration_seconds{stream_table="order_totals",mode="DIFFERENTIAL"} 0.012 pgtrickle_refresh_rows_affected{stream_table="order_totals"} 47 pgtrickle_change_buffer_depth{source_table="orders"} 123 pgtrickle_scheduler_lag_seconds 0.3 pgtrickle_stream_table_staleness_seconds{stream_table="order_totals"} 2.1 ``` The difference is granularity: Prometheus gives you time-series data at scrape resolution (typically 15s). Self-monitoring stream tables give you real-time aggregates at refresh resolution (1–5s). Use both. --- ## Disabling Self-Monitoring If you don't need it — or if you're running in a resource-constrained environment — you can disable it: ```sql SELECT pgtrickle.disable_self_monitoring(); ``` This drops the internal stream tables and frees the scheduler slots they were using. The Prometheus metrics and health check endpoints continue working — they don't depend on self-monitoring. --- ## The Dogfooding Effect Building self-monitoring on top of the same engine forces the pg_trickle team to care about edge cases that only surface under real load. If the monitoring stream tables are slow, that's a bug in the engine. If they produce incorrect results, that's a correctness issue. If they consume too many resources, that's a scheduler efficiency problem. Every improvement to the self-monitoring stream tables is an improvement to the engine itself.