{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "provsql-studio", "display_name": "ProvSQL (SQL)", "language": "sql" }, "language_info": { "name": "sql" }, "provsql": { "scheme": "semiring", "database": "cs6", "generated_from": "doc/source/user/casestudy6.rst" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Case Study: City Air-Quality Sensor Network\n", "\n", "This case study demonstrates ProvSQL's continuous-distribution surface (see the chapter on continuous distributions) end-to-end through ProvSQL Studio (see the Studio chapter). It is the first case study driven primarily by Studio rather than `psql`: random variables benefit far more from interactive visualisation – PDFs, CDFs, mixture DAG layouts, conditional histograms, simplifier before-vs-after – than from text-mode output, and the workflow below makes the rewriter, the simplifier, the analytic and Monte-Carlo paths, and conditional inference all visible in the canvas.\n", "\n", "## The Scenario\n", "\n", "A municipal observatory operates a small air-quality sensor network. Sensors of three different vendors report a $`PM_{2.5}`$ concentration (*fine particulate matter*, i.e. airborne particles with aerodynamic diameter at most 2.5 μm, expressed in micrograms per cubic metre) on a fixed schedule. The sensors differ in calibration and noise characteristics:\n", "\n", "- high-end units report `Normal(μ, σ)` with small σ;\n", "- low-cost units report `Uniform[μ−δ, μ+δ]` over a small window;\n", "- a drift-prone unit reports `Exponential(λ)` while its internal hardware self-tests cycle;\n", "- a multi-pass aggregating unit reports `Erlang(k, λ)` over the pass count.\n", "\n", "A reference station with a calibrated lab-grade instrument contributes deterministic readings.\n", "\n", "Regulatory categories partition the value axis: *Good* below 12, *Moderate* between 12.1 and 35, *Unhealthy* above 35.1 (loosely following the US EPA AQI breakpoints for PM2.5 in their pre-2024 form, simplified to three tiers). Each station has a Bernoulli probability of being in calibration on a given day. A separate batch table of *historical* readings carries the same shape so cross-batch queries via `UNION ALL` are meaningful.\n", "\n", "Your tasks:\n", "\n", "- inspect the per-row distributions and the rewriter's effect on threshold queries;\n", "- compute the probability that each station's reading exceeds an *Unhealthy* threshold, exercising the planner-hook rewrite for `WHERE reading > 35`;\n", "- model calibration uncertainty as a Bernoulli mixture and inspect the resulting `gate_mixture` shape;\n", "- aggregate per-district readings and watch the simplifier fold the mixture cascade;\n", "- run conditional inference (`E[reading | reading > 35]`) and see the closed-form truncated-distribution mean against the unconditional one;\n", "- filter on the expected value of an aggregated random variable, combine today's and yesterday's batches with `UNION ALL`, and compare probability methods (`'independent'` vs `'monte-carlo'` vs `'tree-decomposition'`) side by side.\n", "\n", "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The following cells set up the database with all the content this notebook requires; run them first, ideally on a fresh database.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS readings CASCADE;\n", "DROP TABLE IF EXISTS historical_readings CASCADE;\n", "DROP TABLE IF EXISTS stations CASCADE;\n", "DROP TABLE IF EXISTS calibration_status CASCADE;\n", "DROP TABLE IF EXISTS categories CASCADE;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Four monitoring stations across two districts.\n", "DROP TABLE IF EXISTS stations CASCADE;\n", "CREATE TABLE stations (\n", " id TEXT PRIMARY KEY,\n", " name TEXT NOT NULL,\n", " district TEXT NOT NULL\n", ");" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "TRUNCATE stations;\n", "INSERT INTO stations VALUES\n", " ('s1', 'City Centre', 'centre'),\n", " ('s2', 'Riverside Park', 'centre'),\n", " ('s3', 'Industrial Estate', 'east'),\n", " ('s4', 'Suburban Reference', 'east');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT add_provenance('stations');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Per-station calibration probability (Bernoulli). The high-end and\n", "-- multi-pass units are well calibrated; the low-cost and drift-prone\n", "-- ones less so; the reference station is always in-spec.\n", "DROP TABLE IF EXISTS calibration_status CASCADE;\n", "CREATE TABLE calibration_status (\n", " station_id TEXT PRIMARY KEY REFERENCES stations(id),\n", " p DOUBLE PRECISION NOT NULL\n", ");" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "TRUNCATE calibration_status;\n", "INSERT INTO calibration_status VALUES\n", " ('s1', 0.95),\n", " ('s2', 0.70),\n", " ('s3', 0.60),\n", " ('s4', 1.00);" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT add_provenance('calibration_status');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Regulatory categories (deterministic, no provenance).\n", "DROP TABLE IF EXISTS categories CASCADE;\n", "CREATE TABLE categories (\n", " name TEXT PRIMARY KEY,\n", " lo DOUBLE PRECISION NOT NULL,\n", " hi DOUBLE PRECISION NOT NULL\n", ");" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "TRUNCATE categories;\n", "INSERT INTO categories VALUES\n", " ('Good', 0.0, 12.0),\n", " ('Moderate', 12.1, 35.0),\n", " ('Unhealthy', 35.1, 1000.0);" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Today's readings. Five rows of pm25, one per (station, sample),\n", "-- exercising every distribution family plus the implicit numeric cast.\n", "DROP TABLE IF EXISTS readings CASCADE;\n", "CREATE TABLE readings (\n", " id INTEGER PRIMARY KEY,\n", " station_id TEXT NOT NULL REFERENCES stations(id),\n", " ts TIMESTAMP NOT NULL,\n", " pm25 provsql.random_variable NOT NULL\n", ");" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "TRUNCATE readings;\n", "INSERT INTO readings (id, station_id, ts, pm25) VALUES\n", " (1, 's1', '2026-05-12 08:00', provsql.normal(28.0, 2.0)), -- high-end Gaussian\n", " (2, 's2', '2026-05-12 08:00', provsql.uniform(10.0, 22.0)), -- low-cost uniform window\n", " (3, 's3', '2026-05-12 08:00', provsql.exponential(0.04)), -- drift-prone, mean = 25\n", " (4, 's4', '2026-05-12 08:00', 15.0), -- reference (implicit cast)\n", " (5, 's1', '2026-05-12 09:00', provsql.normal(40.0, 4.0)), -- high-end, into Unhealthy\n", " (6, 's2', '2026-05-12 09:00', provsql.uniform(12.0, 24.0)), -- low-cost, into Moderate\n", " (7, 's3', '2026-05-12 09:00', provsql.erlang(3, 0.1)), -- multi-pass Erlang, mean = 30\n", " (8, 's4', '2026-05-12 09:00', 16.5); -- reference" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT add_provenance('readings');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Yesterday's batch, used for the UNION ALL step. Same shape, slightly\n", "-- different distributions (a heatwave bumped the means).\n", "DROP TABLE IF EXISTS historical_readings CASCADE;\n", "CREATE TABLE historical_readings (\n", " id INTEGER PRIMARY KEY,\n", " station_id TEXT NOT NULL REFERENCES stations(id),\n", " ts TIMESTAMP NOT NULL,\n", " pm25 provsql.random_variable NOT NULL\n", ");" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "TRUNCATE historical_readings;\n", "INSERT INTO historical_readings (id, station_id, ts, pm25) VALUES\n", " (1, 's1', '2026-05-11 08:00', provsql.normal(34.0, 2.5)),\n", " (2, 's2', '2026-05-11 08:00', provsql.uniform(15.0, 28.0)),\n", " (3, 's3', '2026-05-11 08:00', provsql.exponential(0.03)),\n", " (4, 's4', '2026-05-11 08:00', 18.0),\n", " (5, 's1', '2026-05-11 09:00', provsql.normal(42.0, 3.0)),\n", " (6, 's2', '2026-05-11 09:00', provsql.uniform(20.0, 35.0)),\n", " (7, 's3', '2026-05-11 09:00', provsql.erlang(3, 0.08)),\n", " (8, 's4', '2026-05-11 09:00', 19.5);" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT add_provenance('historical_readings');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- A provenance mapping so the Studio eval-strip's sr_formula and\n", "-- PROV-XML export can label leaves with station names rather than\n", "-- raw UUIDs.\n", "DROP TABLE IF EXISTS station_mapping;\n", "DROP TABLE IF EXISTS station_mapping CASCADE;\n", "CREATE TABLE station_mapping AS\n", " SELECT s.name AS value, r.provsql AS provenance\n", " FROM readings r JOIN stations s ON s.id = r.station_id;\n", "SELECT remove_provenance('station_mapping');\n", "CREATE INDEX ON station_mapping(provenance);" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The script creates the schema below and seeds the random-variable readings via the constructors documented in the chapter on continuous distributions. It is five tables:\n", "\n", "- `stations(id, name, district)` – four monitoring stations across two districts, provenance-tracked.\n", "- `readings(station_id, ts, pm25 random_variable)` – one `pm25` reading per station per timestamp; the `random_variable` carries the per-station noise model (normal, uniform, exponential, erlang, or a deterministic lifted from the reference station).\n", "- `calibration_status(station_id, p)` – Bernoulli probability that each station is in calibration on the day of interest.\n", "- `categories(name, lo, hi)` – three regulatory categories (*Good* / *Moderate* / *Unhealthy*) keyed by their interval bounds.\n", "- `historical_readings(...)` – same shape as `readings`, populated from yesterday's batch.\n", "\n", "The schema panel lists the fixture's six relations: the four provenance-tracked tables (`stations`, `calibration_status`, `readings`, `historical_readings`) carry the purple `prov` pill, `categories` is plain, and `station_mapping` is tagged `mapping`. The `pm25` column on `readings` and `historical_readings` is flagged with a terracotta `rv` pill: a heads-up that comparison and arithmetic operators on this column are intercepted by the planner hook and lifted into provenance gates, so a query like `pm25 > 35` produces a circuit rather than a Boolean.\n", "\n", "## Step 1: Inspect a Noisy Reading\n", "\n", "In the Studio query box:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT id, ts, pm25\n", "FROM readings\n", "WHERE station_id = 's1'\n", "ORDER BY ts" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result table renders `pm25` as a clickable `random_variable` cell carrying the underlying gate UUID. Click into a row's `pm25`: Studio switches to Circuit mode and renders the `gate_rv` leaf with the distribution-kind initial in the circle (*N* for a Normal, *U* for Uniform, *Exp* for Exponential, *Erl* for Erlang). Pick *Distribution profile* from the *Distribution* group of the eval strip and click `Run`: the panel returns $`\\mu`$ and $`\\sigma^2`$ headline stats and an inline histogram with a PDF/CDF toggle.\n", "\n", "The histogram is backed server-side by `rv_histogram`; pinning `provsql.monte_carlo_seed` in the Config panel (under *Provenance*) makes the shape reproducible across re-runs.\n", "\n", "## Step 2: A First Probabilistic Threshold\n", "\n", "The $`PM_{2.5}`$ *Unhealthy* category begins at 35.1. Find the rows whose reading might cross it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT id, station_id, ts\n", "FROM readings\n", "WHERE pm25 > 35" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because `pm25` is a random variable, the comparison is not a yes/no test: it stands for the *event* \"this reading exceeds 35\", and ProvSQL attaches that event to each row's provenance.\n", "\n", "Click into a result row's auto-added `provsql` cell. Circuit mode shows the Boolean wrapper (a `gate_times` over the row's input token and the `gate_cmp`); the cmp's child link reaches into the `gate_rv` from Step 1.\n", "\n", "The eval strip's `probability_evaluate` entry exposes the five compiled methods (see the chapter on probabilities). Pick `monte-carlo` and set `n = 10000`; the panel returns the probability with a Hoeffding confidence band. Pin `provsql.monte_carlo_seed = 42` in the Config panel and re-run: the result is now identical across runs. Toggle the seed back to `-1` and re-run to see the band shift between runs.\n", "\n", "## Step 3: The Simplifier in Action\n", "\n", "The planner hook emits the comparator as a raw `gate_cmp` regardless of what its operands look like. A *simplifier* pass then folds comparators whose answer can be decided from the operand support alone, for example, `U(10, 22) > 35` is universally false because the uniform's upper bound is below the threshold. The fold is controlled by `provsql.simplify_on_load` (default on), which the Config panel exposes under *Provenance*.\n", "\n", "Click into row 2's auto-added `provsql` cell from the Step 2 result (station `s2`, `pm25 ~ U(10, 22)`). With `provsql.simplify_on_load` on, the canvas shows a single `𝟘` (zero) gate: the simplifier resolved the comparator to a constant-false leaf and dropped the whole subtree. Toggle the GUC off in the Config panel and click the cell again: the canvas now shows the raw construction shape, a `gate_times` (`⊗`) over the row's input token `ι` and a `gate_cmp` (`>`) whose children are the `U(10, 22)` leaf and the constant `35`. Both views are semantically identical; the simplified view is what the semiring evaluators and the Monte-Carlo sampler actually consume.\n", "\n", "## Step 4: Calibration via Mixtures\n", "\n", "Each station has a probability of being mis-calibrated; a mis-calibrated unit over-reports by 20% (the *reading* it records is `1.2` times the true value). The corrected estimate of the true reading is therefore `pm25` with probability `p` (the station is in spec) and `pm25 / 1.2` with probability `1 - p` (the report needs to be scaled back). Express this as a Bernoulli mixture:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT r.id, r.station_id,\n", " provsql.mixture(cs.p, r.pm25, r.pm25 / 1.2) AS pm25_calibrated\n", "FROM readings r JOIN calibration_status cs USING (station_id)\n", "WHERE r.station_id = 's1'" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click into a result row's `pm25_calibrated` cell. Circuit mode renders the `gate_mixture` as a `Mix` node with three labelled outgoing edges (`p` / `x` / `y`) matching the SQL constructor's argument order: `p` points to the Bernoulli mixing probability, `x` to the in-spec arm, and `y` to the correction arm.\n", "\n", "The same node-inspector panel exposes `Distribution profile` on the mixture root. Because station `s1` is in spec 95% of the time, the histogram is dominated by the `N(28, 2)` arm and the out-of-spec `N(23.33, 1.667)` contributes only a small left shoulder rather than a visually distinct second mode; the panel headline reflects this with a mixture mean slightly below 28. To see clear bimodality, re-run the query with a larger calibration error, e.g. replace `r.pm25 / 1.2` with `r.pm25 / 2.0` so the out-of-spec arm folds to `N(14, 1)`, well separated from the in-spec `N(28, 2)`; the two peaks then show up distinctly on the histogram even at the 95%/5% weighting.\n", "\n", "## Step 5: Aggregation Over Random Variables\n", "\n", "Compute average $`PM_{2.5}`$ per district:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT s.district,\n", " avg(r.pm25) AS avg_pm25,\n", " sum(r.pm25) AS total_pm25\n", "FROM readings r JOIN stations s ON s.id = r.station_id\n", "GROUP BY s.district" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click into a row's `avg_pm25` cell. Circuit mode shows the `avg` lowering: a `gate_arith(DIV, num, denom)` over two `gate_arith(PLUS, …)` subtrees, each child a per-row `gate_mixture` produced by `rv_aggregate_semimod`. The right child of the outer division is the count of *included* rows under their per-row provenance: rows whose provenance is false contribute the additive identity to both numerator and denominator. Run *Distribution profile* on the root: the panel shows the per-district average as a tight distribution centred at the inclusion-weighted mean.\n", "\n", "## Step 6: Conditional Inference\n", "\n", "Re-open the filtered query from Step 2:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT id, station_id, ts, pm25\n", "FROM readings\n", "WHERE pm25 > 35\n", " AND station_id = 's1'" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click a result row's `pm25` cell. The eval strip's `Condition on` text input auto-presets to the row's provenance UUID, and the `Conditioned by:` badge underneath the input is active. Pick *Distribution profile* and run: the histogram now shows the *truncated* shape, restricted to the tail above `35`. Pick *Moment* with `k = 1` and `raw`: the panel returns the closed-form Mills-ratio mean of the [truncated normal](https://en.wikipedia.org/wiki/Truncated_normal_distribution), exactly $`\\mu + \\sigma \\cdot\n", "\\frac{\\phi(\\alpha)}{1 - \\Phi(\\alpha)}`$ with $`\\alpha = (35 - \\mu)/\\sigma`$. Click the active badge to clear the conditioning; the panel reverts to the unconditional mean $`\\mu`$. Click the muted badge to restore the row provenance.\n", "\n", "The closed-form truncation table covers Normal (Mills ratio), Uniform (intersected support), and Exponential (memorylessness on a lower bound or finite-interval truncation). For other shapes, the joint circuit between `pm25` and the row's provenance is loaded with shared `gate_rv` leaves correctly coupled, and the conditional moment is estimated by rejection sampling at budget `provsql.rv_mc_samples`.\n", "\n", "## Step 7: Diagnostic Sampling\n", "\n", "For raw inspection or downstream analytics, draw samples from the conditional distribution. With the same row pinned and the *Conditioned by* badge active, pick *Sample* from the *Distribution* group; set `n = 200` and run. The result panel shows a six-value inline preview with a \"show full list\" expander; clicking it dumps all 200 samples.\n", "\n", "For shapes that fall outside the closed-form table the sampler falls back to rejection sampling at the `provsql.rv_mc_samples` budget; if the conditioning event is so unlikely that fewer than `n` samples land inside that budget, the panel surfaces a hint pointing at the GUC, e.g. *Only 47 samples accepted within budget 10000; widen* `provsql.rv_mc_samples` *or loosen the conditioning*. Re-running with a larger budget (set `rv_mc_samples = 50000` in the Config panel) recovers the full batch.\n", "\n", "## Step 8: Combining Batches via UNION\n", "\n", "Both batches share the same id space (rows `1` through `8`, one per `(station, timestamp)` slot), so a `UNION` (without `ALL`) over `(station_id, id)` deduplicates a slot to a single result row whose provenance combines today's reading and yesterday's reading via the semiring addition. With `WHERE pm25 > 35` lifted on each branch, each contributing row carries a `gate_cmp(pm25 > 35)` of its own:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "(SELECT station_id, id FROM readings WHERE pm25 > 35)\n", "UNION\n", "(SELECT station_id, id FROM historical_readings WHERE pm25 > 35)\n", "ORDER BY station_id, id" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pick a result row's `provsql` cell: Circuit mode shows a `gate_plus` (`⊕`) over the two contributing inputs (today's row and yesterday's row), each carrying its own `gate_cmp(pm25 > 35)` from the lifted `WHERE`. `probability_evaluate(provenance())` on the result gives the probability that *at least one* of the two days produced an Unhealthy reading for that slot. We deliberately keep the `random_variable` `pm25` column out of the `SELECT`: there is no duplicate-elimination semantics for `random_variable`, so a `UNION` over an RV column would have no well-defined meaning.\n", "\n", "## Step 9: Filtering Grouped Random Variables by Expected Value\n", "\n", "Filter the per-district aggregates from Step 5 by their expected average. Because `avg` over a `random_variable` column returns a `random_variable` (not an `agg_token`), and `expected` collapses it to a plain `double`, the HAVING qual is deterministic from the planner-hook's perspective; the rewrite leaves it for PostgreSQL to evaluate natively while still adding a `delta(gate_agg)` wrapper to each surviving group's provenance:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT s.district, avg(r.pm25) AS avg_pm25\n", "FROM readings r JOIN stations s ON s.id = r.station_id\n", "GROUP BY s.district\n", "HAVING expected(avg(r.pm25)) > 20" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The inner `avg` is recognised as a `random_variable` aggregate (gate_arith DIV over per-row gate_mixture children, as in Step 5); `expected` collapses the distribution to its mean (Monte Carlo here, since the DIV gate has no closed-form evaluator); the `> 20` is a plain comparison on a `double`, so the row survives iff its expected average exceeds the threshold. For the case-study fixture both districts pass (centre at ≈ 25.5, east at ≈ 21.6); clicking either result row's `provsql` cell shows the `delta(gate_agg)` shape, identical to the no-HAVING aggregate from Step 5 but filtered to the surviving groups.\n", "\n", "## Step 10: Independent vs Monte Carlo\n", "\n", "For threshold queries whose contributing rows have structurally independent provenance, the `'independent'` probability method (see the chapter on probabilities) is *exact* and far cheaper than Monte Carlo. Compare the three available exact methods against `monte-carlo` on the Step 2 query:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT id,\n", " probability_evaluate(provenance(), 'independent') AS p_ind,\n", " probability_evaluate(provenance(), 'monte-carlo', '10000') AS p_mc,\n", " probability_evaluate(provenance(), 'tree-decomposition') AS p_td\n", "FROM readings WHERE pm25 > 35" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Studio's eval strip exposes these methods directly; running each method against the same pinned subnode shows the analytic `independent` and `tree-decomposition` returning the same value to full precision, while `monte-carlo` returns a Hoeffding-bounded estimate that tightens as `n` grows.\n", "\n", "See the chapter on continuous distributions for the full surface and the Studio chapter for the Studio reference." ] } ] }