[← Back to Blog Index](README.md)

# Error Budgets for Stream Tables

## SRE-style freshness monitoring with p50/p99 latency, staleness tracking, and budget consumption

---

Your team has an SLA: "the dashboard must reflect data no older than 30 seconds." You've set the stream table schedule to 5 seconds. That should leave plenty of headroom. But is it?

What if the stream table occasionally takes 20 seconds to refresh because of a complex join? What if it fails 3 times in a row and the scheduler suspends it? What if the change buffer grows large enough to cause a FULL fallback?

You won't know unless you measure. pg_trickle's `sla_summary()` function provides SRE-style metrics: percentile latencies, freshness lag, error counts, and an error budget that tells you how much headroom you have before your SLA is violated.

---

## The SLA Summary API

```sql
SELECT * FROM pgtrickle.sla_summary('dashboard_metrics');
```

```
 stream_table      | p50_refresh_ms | p99_refresh_ms | freshness_lag_s | error_count | error_budget_pct | window
-------------------+----------------+----------------+-----------------+-------------+------------------+---------
 dashboard_metrics | 4.8            | 23.1           | 2.3             | 1           | 94.2             | 1h
```

### What Each Metric Means

**p50_refresh_ms:** The median refresh duration over the measurement window. If this is close to your schedule interval, you're refreshing slower than you're scheduling.

**p99_refresh_ms:** The 99th percentile refresh duration. This is your worst-case performance (excluding true outliers). If your schedule is 5 seconds and p99 is 23 seconds, 1% of your refreshes are taking nearly 5× longer than expected.

**freshness_lag_s:** The current staleness — time since the last successful refresh completed. If this exceeds your SLA, the data is stale right now.

**error_count:** Number of failed refreshes in the measurement window. Each failure means one missed refresh cycle.

**error_budget_pct:** The remaining error budget as a percentage. This is the key metric:

$$\text{error\_budget\_pct} = 100 \times \left(1 - \frac{\text{actual\_staleness\_violations}}{\text{allowed\_staleness\_violations}}\right)$$

At 100%, you've had zero violations. At 0%, you've exhausted your budget. Below 0%, you're in breach.

---

## Defining SLAs

pg_trickle doesn't enforce SLAs — it measures compliance. The SLA is defined by your team's freshness requirement.

The `sla_summary()` function accepts an optional window parameter:

```sql
-- Last hour (default)
SELECT * FROM pgtrickle.sla_summary('dashboard_metrics');

-- Last 24 hours
SELECT * FROM pgtrickle.sla_summary('dashboard_metrics', window => '24h');

-- Last 7 days
SELECT * FROM pgtrickle.sla_summary('dashboard_metrics', window => '7d');
```

For a 30-second freshness SLA measured over 1 hour:
- There are 120 expected refresh opportunities (at 30-second intervals).
- If 6 of those opportunities were missed (refresh took too long or failed), that's a 5% violation rate → 95% error budget remaining.

---

## Alerting on Error Budget

Combine `sla_summary()` with reactive subscriptions or pg_cron to alert when the budget is low:

```sql
-- pg_cron: check error budget every 5 minutes
SELECT cron.schedule('sla-check', '*/5 * * * *', $$
  DO $$
  DECLARE
    budget NUMERIC;
  BEGIN
    SELECT error_budget_pct INTO budget
    FROM pgtrickle.sla_summary('dashboard_metrics');

    IF budget < 50 THEN
      PERFORM pg_notify('sla_warning',
        format('dashboard_metrics error budget at %s%%', budget));
    END IF;
    IF budget < 10 THEN
      PERFORM pg_notify('sla_critical',
        format('dashboard_metrics error budget at %s%%', budget));
    END IF;
  END $$;
$$);
```

A listener picks up the NOTIFY and routes it to PagerDuty, Slack, or your alerting system.

---

## Setting Meaningful SLAs

Different stream tables serve different purposes. A real-time fraud detection table and a weekly summary report shouldn't have the same SLA.

Recommended tiers:

| Tier | Freshness Target | Example |
|------|-----------------|---------|
| Critical | < 5 seconds | Fraud detection, inventory levels, live pricing |
| Standard | < 30 seconds | Dashboards, operational metrics, search indexes |
| Background | < 5 minutes | Daily reports, batch analytics, archival |

For each tier, set the stream table schedule to ~⅓ of the freshness target. This gives 3 refresh attempts before the SLA is breached:

| Tier | Freshness Target | Schedule |
|------|-----------------|----------|
| Critical | 5s | 1–2s |
| Standard | 30s | 10s |
| Background | 5m | 1–2m |

---

## Diagnosing Budget Consumption

When the error budget is dropping, diagnose the cause:

### 1. Check refresh history

```sql
SELECT
  refresh_id,
  refresh_mode,
  duration_ms,
  rows_changed,
  status,
  error_message
FROM pgtrickle.get_refresh_history('dashboard_metrics')
WHERE status != 'success' OR duration_ms > 10000
ORDER BY refresh_id DESC
LIMIT 20;
```

Look for:
- `status = 'error'` — failed refreshes. Check `error_message`.
- `duration_ms` spikes — slow refreshes that may exceed the SLA window.
- `refresh_mode = 'FULL'` — fallbacks from DIFFERENTIAL to FULL, which are usually slower.

### 2. Check for mode fallbacks

```sql
SELECT
  COUNT(*) FILTER (WHERE refresh_mode = 'DIFFERENTIAL') AS differential_count,
  COUNT(*) FILTER (WHERE refresh_mode = 'FULL') AS full_count,
  AVG(duration_ms) FILTER (WHERE refresh_mode = 'DIFFERENTIAL') AS avg_diff_ms,
  AVG(duration_ms) FILTER (WHERE refresh_mode = 'FULL') AS avg_full_ms
FROM pgtrickle.get_refresh_history('dashboard_metrics')
WHERE created_at > NOW() - INTERVAL '1 hour';
```

If FULL refreshes are frequent and slow, investigate why DIFFERENTIAL is falling back (change ratio too high, spilling, etc.).

### 3. Check change buffer sizes

```sql
SELECT * FROM pgtrickle.pgt_cdc_status
WHERE pgt_name = 'dashboard_metrics';
```

Large change buffers → larger deltas → slower DIFFERENTIAL refreshes → potential SLA violations.

---

## Error Budget Across All Stream Tables

For a system-wide view:

```sql
SELECT
  s.name,
  s.schedule,
  sla.p50_refresh_ms,
  sla.p99_refresh_ms,
  sla.freshness_lag_s,
  sla.error_budget_pct,
  CASE
    WHEN sla.error_budget_pct >= 95 THEN 'healthy'
    WHEN sla.error_budget_pct >= 50 THEN 'warning'
    ELSE 'critical'
  END AS status
FROM pgtrickle.pgt_stream_tables s
CROSS JOIN LATERAL pgtrickle.sla_summary(s.name) sla
ORDER BY sla.error_budget_pct ASC;
```

This ranks all stream tables by error budget health. Tables at the top of the list need attention.

---

## Prometheus Integration

pg_trickle exports SLA metrics as Prometheus gauges:

```
pg_trickle_refresh_p50_ms{stream_table="dashboard_metrics"} 4.8
pg_trickle_refresh_p99_ms{stream_table="dashboard_metrics"} 23.1
pg_trickle_freshness_lag_seconds{stream_table="dashboard_metrics"} 2.3
pg_trickle_error_budget_pct{stream_table="dashboard_metrics"} 94.2
```

Use these in Grafana dashboards and alerting rules:

```yaml
# Prometheus alert rule
- alert: StreamTableSLABreach
  expr: pg_trickle_error_budget_pct < 20
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Stream table {{ $labels.stream_table }} error budget below 20%"
```

---

## Summary

`sla_summary()` gives you SRE-style visibility into stream table health: percentile latencies, freshness lag, error counts, and error budget.

Set SLAs per tier (critical/standard/background). Monitor the error budget. Alert when it drops below a threshold. Diagnose with refresh history, mode fallback analysis, and change buffer checks.

The goal: know you're meeting your freshness SLA *before* a customer asks why the dashboard is stale. The error budget tells you exactly how much headroom you have left.