{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "name": "provsql-studio",
   "display_name": "ProvSQL (SQL)",
   "language": "sql"
  },
  "language_info": {
   "name": "sql"
  },
  "provsql": {
   "scheme": "semiring",
   "database": "cs6",
   "generated_from": "doc/source/user/casestudy6.rst"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Case Study: City Air-Quality Sensor Network\n",
    "\n",
    "This case study demonstrates ProvSQL's continuous-distribution surface (see the chapter on continuous distributions) end-to-end through ProvSQL Studio (see the Studio chapter). It is the first case study driven primarily by Studio rather than `psql`: random variables benefit far more from interactive visualisation – PDFs, CDFs, mixture DAG layouts, conditional histograms, simplifier before-vs-after – than from text-mode output, and the workflow below makes the rewriter, the simplifier, the analytic and Monte-Carlo paths, and conditional inference all visible in the canvas.\n",
    "\n",
    "## The Scenario\n",
    "\n",
    "A municipal observatory operates a small air-quality sensor network. Sensors of three different vendors report a $`PM_{2.5}`$ concentration (*fine particulate matter*, i.e. airborne particles with aerodynamic diameter at most 2.5 μm, expressed in micrograms per cubic metre) on a fixed schedule. The sensors differ in calibration and noise characteristics:\n",
    "\n",
    "- high-end units report `Normal(μ, σ)` with small σ;\n",
    "- low-cost units report `Uniform[μ−δ, μ+δ]` over a small window;\n",
    "- a drift-prone unit reports `Exponential(λ)` while its internal hardware self-tests cycle;\n",
    "- a multi-pass aggregating unit reports `Erlang(k, λ)` over the pass count.\n",
    "\n",
    "A reference station with a calibrated lab-grade instrument contributes deterministic readings.\n",
    "\n",
    "Regulatory categories partition the value axis: *Good* below 12, *Moderate* between 12.1 and 35, *Unhealthy* above 35.1 (loosely following the US EPA AQI breakpoints for PM2.5 in their pre-2024 form, simplified to three tiers). Each station has a Bernoulli probability of being in calibration on a given day. A separate batch table of *historical* readings carries the same shape so cross-batch queries via `UNION ALL` are meaningful.\n",
    "\n",
    "Your tasks:\n",
    "\n",
    "- inspect the per-row distributions and the rewriter's effect on threshold queries;\n",
    "- compute the probability that each station's reading exceeds an *Unhealthy* threshold, exercising the planner-hook rewrite for `WHERE reading > 35`;\n",
    "- model calibration uncertainty as a Bernoulli mixture and inspect the resulting `gate_mixture` shape;\n",
    "- aggregate per-district readings and watch the simplifier fold the mixture cascade;\n",
    "- run conditional inference (`E[reading | reading > 35]`) and see the closed-form truncated-distribution mean against the unconditional one;\n",
    "- filter on the expected value of an aggregated random variable, combine today's and yesterday's batches with `UNION ALL`, and compare probability methods (`'independent'` vs `'monte-carlo'` vs `'tree-decomposition'`) side by side.\n",
    "\n",
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*The following cells set up the database with all the content this notebook requires; run them first, ideally on a fresh database.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "DROP TABLE IF EXISTS readings CASCADE;\n",
    "DROP TABLE IF EXISTS historical_readings CASCADE;\n",
    "DROP TABLE IF EXISTS stations CASCADE;\n",
    "DROP TABLE IF EXISTS calibration_status CASCADE;\n",
    "DROP TABLE IF EXISTS categories CASCADE;"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "-- Four monitoring stations across two districts.\n",
    "DROP TABLE IF EXISTS stations CASCADE;\n",
    "CREATE TABLE stations (\n",
    "  id        TEXT PRIMARY KEY,\n",
    "  name      TEXT NOT NULL,\n",
    "  district  TEXT NOT NULL\n",
    ");"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "TRUNCATE stations;\n",
    "INSERT INTO stations VALUES\n",
    "  ('s1', 'City Centre',         'centre'),\n",
    "  ('s2', 'Riverside Park',      'centre'),\n",
    "  ('s3', 'Industrial Estate',   'east'),\n",
    "  ('s4', 'Suburban Reference',  'east');"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT add_provenance('stations');"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "-- Per-station calibration probability (Bernoulli).  The high-end and\n",
    "-- multi-pass units are well calibrated; the low-cost and drift-prone\n",
    "-- ones less so; the reference station is always in-spec.\n",
    "DROP TABLE IF EXISTS calibration_status CASCADE;\n",
    "CREATE TABLE calibration_status (\n",
    "  station_id  TEXT PRIMARY KEY REFERENCES stations(id),\n",
    "  p           DOUBLE PRECISION NOT NULL\n",
    ");"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "TRUNCATE calibration_status;\n",
    "INSERT INTO calibration_status VALUES\n",
    "  ('s1', 0.95),\n",
    "  ('s2', 0.70),\n",
    "  ('s3', 0.60),\n",
    "  ('s4', 1.00);"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT add_provenance('calibration_status');"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "-- Regulatory categories (deterministic, no provenance).\n",
    "DROP TABLE IF EXISTS categories CASCADE;\n",
    "CREATE TABLE categories (\n",
    "  name  TEXT PRIMARY KEY,\n",
    "  lo    DOUBLE PRECISION NOT NULL,\n",
    "  hi    DOUBLE PRECISION NOT NULL\n",
    ");"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "TRUNCATE categories;\n",
    "INSERT INTO categories VALUES\n",
    "  ('Good',       0.0,    12.0),\n",
    "  ('Moderate',   12.1,   35.0),\n",
    "  ('Unhealthy',  35.1,   1000.0);"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "-- Today's readings.  Five rows of pm25, one per (station, sample),\n",
    "-- exercising every distribution family plus the implicit numeric cast.\n",
    "DROP TABLE IF EXISTS readings CASCADE;\n",
    "CREATE TABLE readings (\n",
    "  id          INTEGER PRIMARY KEY,\n",
    "  station_id  TEXT NOT NULL REFERENCES stations(id),\n",
    "  ts          TIMESTAMP NOT NULL,\n",
    "  pm25        provsql.random_variable NOT NULL\n",
    ");"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "TRUNCATE readings;\n",
    "INSERT INTO readings (id, station_id, ts, pm25) VALUES\n",
    "  (1, 's1', '2026-05-12 08:00', provsql.normal(28.0, 2.0)),       -- high-end Gaussian\n",
    "  (2, 's2', '2026-05-12 08:00', provsql.uniform(10.0, 22.0)),     -- low-cost uniform window\n",
    "  (3, 's3', '2026-05-12 08:00', provsql.exponential(0.04)),       -- drift-prone, mean = 25\n",
    "  (4, 's4', '2026-05-12 08:00', 15.0),                            -- reference (implicit cast)\n",
    "  (5, 's1', '2026-05-12 09:00', provsql.normal(40.0, 4.0)),       -- high-end, into Unhealthy\n",
    "  (6, 's2', '2026-05-12 09:00', provsql.uniform(12.0, 24.0)),     -- low-cost, into Moderate\n",
    "  (7, 's3', '2026-05-12 09:00', provsql.erlang(3, 0.1)),          -- multi-pass Erlang, mean = 30\n",
    "  (8, 's4', '2026-05-12 09:00', 16.5);                            -- reference"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT add_provenance('readings');"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "-- Yesterday's batch, used for the UNION ALL step.  Same shape, slightly\n",
    "-- different distributions (a heatwave bumped the means).\n",
    "DROP TABLE IF EXISTS historical_readings CASCADE;\n",
    "CREATE TABLE historical_readings (\n",
    "  id          INTEGER PRIMARY KEY,\n",
    "  station_id  TEXT NOT NULL REFERENCES stations(id),\n",
    "  ts          TIMESTAMP NOT NULL,\n",
    "  pm25        provsql.random_variable NOT NULL\n",
    ");"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "TRUNCATE historical_readings;\n",
    "INSERT INTO historical_readings (id, station_id, ts, pm25) VALUES\n",
    "  (1, 's1', '2026-05-11 08:00', provsql.normal(34.0, 2.5)),\n",
    "  (2, 's2', '2026-05-11 08:00', provsql.uniform(15.0, 28.0)),\n",
    "  (3, 's3', '2026-05-11 08:00', provsql.exponential(0.03)),\n",
    "  (4, 's4', '2026-05-11 08:00', 18.0),\n",
    "  (5, 's1', '2026-05-11 09:00', provsql.normal(42.0, 3.0)),\n",
    "  (6, 's2', '2026-05-11 09:00', provsql.uniform(20.0, 35.0)),\n",
    "  (7, 's3', '2026-05-11 09:00', provsql.erlang(3, 0.08)),\n",
    "  (8, 's4', '2026-05-11 09:00', 19.5);"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT add_provenance('historical_readings');"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "-- A provenance mapping so the Studio eval-strip's sr_formula and\n",
    "-- PROV-XML export can label leaves with station names rather than\n",
    "-- raw UUIDs.\n",
    "DROP TABLE IF EXISTS station_mapping;\n",
    "DROP TABLE IF EXISTS station_mapping CASCADE;\n",
    "CREATE TABLE station_mapping AS\n",
    "  SELECT s.name AS value, r.provsql AS provenance\n",
    "  FROM readings r JOIN stations s ON s.id = r.station_id;\n",
    "SELECT remove_provenance('station_mapping');\n",
    "CREATE INDEX ON station_mapping(provenance);"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The script creates the schema below and seeds the random-variable readings via the constructors documented in the chapter on continuous distributions. It is five tables:\n",
    "\n",
    "- `stations(id, name, district)` – four monitoring stations across two districts, provenance-tracked.\n",
    "- `readings(station_id, ts, pm25 random_variable)` – one `pm25` reading per station per timestamp; the `random_variable` carries the per-station noise model (normal, uniform, exponential, erlang, or a deterministic lifted from the reference station).\n",
    "- `calibration_status(station_id, p)` – Bernoulli probability that each station is in calibration on the day of interest.\n",
    "- `categories(name, lo, hi)` – three regulatory categories (*Good* / *Moderate* / *Unhealthy*) keyed by their interval bounds.\n",
    "- `historical_readings(...)` – same shape as `readings`, populated from yesterday's batch.\n",
    "\n",
    "The schema panel lists the fixture's six relations: the four provenance-tracked tables (`stations`, `calibration_status`, `readings`, `historical_readings`) carry the purple `prov` pill, `categories` is plain, and `station_mapping` is tagged `mapping`. The `pm25` column on `readings` and `historical_readings` is flagged with a terracotta `rv` pill: a heads-up that comparison and arithmetic operators on this column are intercepted by the planner hook and lifted into provenance gates, so a query like `pm25 > 35` produces a circuit rather than a Boolean.\n",
    "\n",
    "## Step 1: Inspect a Noisy Reading\n",
    "\n",
    "In the Studio query box:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT id, ts, pm25\n",
    "FROM readings\n",
    "WHERE station_id = 's1'\n",
    "ORDER BY ts"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result table renders `pm25` as a clickable `random_variable` cell carrying the underlying gate UUID. Click into a row's `pm25`: Studio switches to Circuit mode and renders the `gate_rv` leaf with the distribution-kind initial in the circle (*N* for a Normal, *U* for Uniform, *Exp* for Exponential, *Erl* for Erlang). Pick *Distribution profile* from the *Distribution* group of the eval strip and click `Run`: the panel returns $`\\mu`$ and $`\\sigma^2`$ headline stats and an inline histogram with a PDF/CDF toggle.\n",
    "\n",
    "The histogram is backed server-side by `rv_histogram`; pinning `provsql.monte_carlo_seed` in the Config panel (under *Provenance*) makes the shape reproducible across re-runs.\n",
    "\n",
    "## Step 2: A First Probabilistic Threshold\n",
    "\n",
    "The $`PM_{2.5}`$ *Unhealthy* category begins at 35.1. Find the rows whose reading might cross it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT id, station_id, ts\n",
    "FROM readings\n",
    "WHERE pm25 > 35"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because `pm25` is a random variable, the comparison is not a yes/no test: it stands for the *event* \"this reading exceeds 35\", and ProvSQL attaches that event to each row's provenance.\n",
    "\n",
    "Click into a result row's auto-added `provsql` cell. Circuit mode shows the Boolean wrapper (a `gate_times` over the row's input token and the `gate_cmp`); the cmp's child link reaches into the `gate_rv` from Step 1.\n",
    "\n",
    "The eval strip's `probability_evaluate` entry exposes the five compiled methods (see the chapter on probabilities). Pick `monte-carlo` and set `n = 10000`; the panel returns the probability with a Hoeffding confidence band. Pin `provsql.monte_carlo_seed = 42` in the Config panel and re-run: the result is now identical across runs. Toggle the seed back to `-1` and re-run to see the band shift between runs.\n",
    "\n",
    "## Step 3: The Simplifier in Action\n",
    "\n",
    "The planner hook emits the comparator as a raw `gate_cmp` regardless of what its operands look like. A *simplifier* pass then folds comparators whose answer can be decided from the operand support alone, for example, `U(10, 22) > 35` is universally false because the uniform's upper bound is below the threshold. The fold is controlled by `provsql.simplify_on_load` (default on), which the Config panel exposes under *Provenance*.\n",
    "\n",
    "Click into row 2's auto-added `provsql` cell from the Step 2 result (station `s2`, `pm25 ~ U(10, 22)`). With `provsql.simplify_on_load` on, the canvas shows a single `𝟘` (zero) gate: the simplifier resolved the comparator to a constant-false leaf and dropped the whole subtree. Toggle the GUC off in the Config panel and click the cell again: the canvas now shows the raw construction shape, a `gate_times` (`⊗`) over the row's input token `ι` and a `gate_cmp` (`>`) whose children are the `U(10, 22)` leaf and the constant `35`. Both views are semantically identical; the simplified view is what the semiring evaluators and the Monte-Carlo sampler actually consume.\n",
    "\n",
    "## Step 4: Calibration via Mixtures\n",
    "\n",
    "Each station has a probability of being mis-calibrated; a mis-calibrated unit over-reports by 20% (the *reading* it records is `1.2` times the true value). The corrected estimate of the true reading is therefore `pm25` with probability `p` (the station is in spec) and `pm25 / 1.2` with probability `1 - p` (the report needs to be scaled back). Express this as a Bernoulli mixture:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT r.id, r.station_id,\n",
    "       provsql.mixture(cs.p, r.pm25, r.pm25 / 1.2) AS pm25_calibrated\n",
    "FROM readings r JOIN calibration_status cs USING (station_id)\n",
    "WHERE r.station_id = 's1'"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Click into a result row's `pm25_calibrated` cell. Circuit mode renders the `gate_mixture` as a `Mix` node with three labelled outgoing edges (`p` / `x` / `y`) matching the SQL constructor's argument order: `p` points to the Bernoulli mixing probability, `x` to the in-spec arm, and `y` to the correction arm.\n",
    "\n",
    "The same node-inspector panel exposes `Distribution profile` on the mixture root. Because station `s1` is in spec 95% of the time, the histogram is dominated by the `N(28, 2)` arm and the out-of-spec `N(23.33, 1.667)` contributes only a small left shoulder rather than a visually distinct second mode; the panel headline reflects this with a mixture mean slightly below 28. To see clear bimodality, re-run the query with a larger calibration error, e.g. replace `r.pm25 / 1.2` with `r.pm25 / 2.0` so the out-of-spec arm folds to `N(14, 1)`, well separated from the in-spec `N(28, 2)`; the two peaks then show up distinctly on the histogram even at the 95%/5% weighting.\n",
    "\n",
    "## Step 5: Aggregation Over Random Variables\n",
    "\n",
    "Compute average $`PM_{2.5}`$ per district:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT s.district,\n",
    "       avg(r.pm25)  AS avg_pm25,\n",
    "       sum(r.pm25)  AS total_pm25\n",
    "FROM readings r JOIN stations s ON s.id = r.station_id\n",
    "GROUP BY s.district"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Click into a row's `avg_pm25` cell. Circuit mode shows the `avg` lowering: a `gate_arith(DIV, num, denom)` over two `gate_arith(PLUS, …)` subtrees, each child a per-row `gate_mixture` produced by `rv_aggregate_semimod`. The right child of the outer division is the count of *included* rows under their per-row provenance: rows whose provenance is false contribute the additive identity to both numerator and denominator. Run *Distribution profile* on the root: the panel shows the per-district average as a tight distribution centred at the inclusion-weighted mean.\n",
    "\n",
    "## Step 6: Conditional Inference\n",
    "\n",
    "Re-open the filtered query from Step 2:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT id, station_id, ts, pm25\n",
    "FROM readings\n",
    "WHERE pm25 > 35\n",
    "  AND station_id = 's1'"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Click a result row's `pm25` cell. The eval strip's `Condition on` text input auto-presets to the row's provenance UUID, and the `Conditioned by:` badge underneath the input is active. Pick *Distribution profile* and run: the histogram now shows the *truncated* shape, restricted to the tail above `35`. Pick *Moment* with `k = 1` and `raw`: the panel returns the closed-form Mills-ratio mean of the [truncated normal](https://en.wikipedia.org/wiki/Truncated_normal_distribution), exactly $`\\mu + \\sigma \\cdot\n",
    "\\frac{\\phi(\\alpha)}{1 - \\Phi(\\alpha)}`$ with $`\\alpha = (35 - \\mu)/\\sigma`$. Click the active badge to clear the conditioning; the panel reverts to the unconditional mean $`\\mu`$. Click the muted badge to restore the row provenance.\n",
    "\n",
    "The closed-form truncation table covers Normal (Mills ratio), Uniform (intersected support), and Exponential (memorylessness on a lower bound or finite-interval truncation). For other shapes, the joint circuit between `pm25` and the row's provenance is loaded with shared `gate_rv` leaves correctly coupled, and the conditional moment is estimated by rejection sampling at budget `provsql.rv_mc_samples`.\n",
    "\n",
    "## Step 7: Diagnostic Sampling\n",
    "\n",
    "For raw inspection or downstream analytics, draw samples from the conditional distribution. With the same row pinned and the *Conditioned by* badge active, pick *Sample* from the *Distribution* group; set `n = 200` and run. The result panel shows a six-value inline preview with a \"show full list\" expander; clicking it dumps all 200 samples.\n",
    "\n",
    "For shapes that fall outside the closed-form table the sampler falls back to rejection sampling at the `provsql.rv_mc_samples` budget; if the conditioning event is so unlikely that fewer than `n` samples land inside that budget, the panel surfaces a hint pointing at the GUC, e.g. *Only 47 samples accepted within budget 10000; widen* `provsql.rv_mc_samples` *or loosen the conditioning*. Re-running with a larger budget (set `rv_mc_samples = 50000` in the Config panel) recovers the full batch.\n",
    "\n",
    "## Step 8: Combining Batches via UNION\n",
    "\n",
    "Both batches share the same id space (rows `1` through `8`, one per `(station, timestamp)` slot), so a `UNION` (without `ALL`) over `(station_id, id)` deduplicates a slot to a single result row whose provenance combines today's reading and yesterday's reading via the semiring addition. With `WHERE pm25 > 35` lifted on each branch, each contributing row carries a `gate_cmp(pm25 > 35)` of its own:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "(SELECT station_id, id FROM readings            WHERE pm25 > 35)\n",
    "UNION\n",
    "(SELECT station_id, id FROM historical_readings WHERE pm25 > 35)\n",
    "ORDER BY station_id, id"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pick a result row's `provsql` cell: Circuit mode shows a `gate_plus` (`⊕`) over the two contributing inputs (today's row and yesterday's row), each carrying its own `gate_cmp(pm25 > 35)` from the lifted `WHERE`. `probability_evaluate(provenance())` on the result gives the probability that *at least one* of the two days produced an Unhealthy reading for that slot. We deliberately keep the `random_variable` `pm25` column out of the `SELECT`: there is no duplicate-elimination semantics for `random_variable`, so a `UNION` over an RV column would have no well-defined meaning.\n",
    "\n",
    "## Step 9: Filtering Grouped Random Variables by Expected Value\n",
    "\n",
    "Filter the per-district aggregates from Step 5 by their expected average. Because `avg` over a `random_variable` column returns a `random_variable` (not an `agg_token`), and `expected` collapses it to a plain `double`, the HAVING qual is deterministic from the planner-hook's perspective; the rewrite leaves it for PostgreSQL to evaluate natively while still adding a `delta(gate_agg)` wrapper to each surviving group's provenance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT s.district, avg(r.pm25) AS avg_pm25\n",
    "FROM readings r JOIN stations s ON s.id = r.station_id\n",
    "GROUP BY s.district\n",
    "HAVING expected(avg(r.pm25)) > 20"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The inner `avg` is recognised as a `random_variable` aggregate (gate_arith DIV over per-row gate_mixture children, as in Step 5); `expected` collapses the distribution to its mean (Monte Carlo here, since the DIV gate has no closed-form evaluator); the `> 20` is a plain comparison on a `double`, so the row survives iff its expected average exceeds the threshold. For the case-study fixture both districts pass (centre at ≈ 25.5, east at ≈ 21.6); clicking either result row's `provsql` cell shows the `delta(gate_agg)` shape, identical to the no-HAVING aggregate from Step 5 but filtered to the surviving groups.\n",
    "\n",
    "## Step 10: Independent vs Monte Carlo\n",
    "\n",
    "For threshold queries whose contributing rows have structurally independent provenance, the `'independent'` probability method (see the chapter on probabilities) is *exact* and far cheaper than Monte Carlo. Compare the three available exact methods against `monte-carlo` on the Step 2 query:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT id,\n",
    "  probability_evaluate(provenance(), 'independent')         AS p_ind,\n",
    "  probability_evaluate(provenance(), 'monte-carlo', '10000') AS p_mc,\n",
    "  probability_evaluate(provenance(), 'tree-decomposition')  AS p_td\n",
    "FROM readings WHERE pm25 > 35"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Studio's eval strip exposes these methods directly; running each method against the same pinned subnode shows the analytic `independent` and `tree-decomposition` returning the same value to full precision, while `monte-carlo` returns a Hoeffding-bounded estimate that tightens as `n` grows.\n",
    "\n",
    "See the chapter on continuous distributions for the full surface and the Studio chapter for the Studio reference."
   ]
  }
 ]
}