Case Study: The City Air-Quality Sensor Network
=================================================
This case study demonstrates ProvSQL's continuous-distribution
surface (see :doc:`continuous-distributions`) end-to-end through
ProvSQL Studio (see :doc:`studio`). It is the first case study
driven primarily by Studio rather than ``psql``: random variables
benefit far more from interactive visualisation – PDFs, CDFs,
mixture DAG layouts, conditional histograms, simplifier
before-vs-after – than from text-mode output, and the workflow
below makes the rewriter, the simplifier, the analytic and
Monte-Carlo paths, and conditional inference all visible in the
canvas.
The Scenario
------------
A municipal observatory operates a small air-quality sensor
network. Sensors of three different vendors report a
:math:`PM_{2.5}` concentration (*fine particulate matter*, i.e.
airborne particles with aerodynamic diameter at most 2.5 μm,
expressed in micrograms per cubic metre) on a fixed schedule. The sensors differ in calibration and noise
characteristics:
* high-end units report ``Normal(μ, σ)`` with small σ;
* low-cost units report ``Uniform[μ−δ, μ+δ]`` over a small window;
* a drift-prone unit reports ``Exponential(λ)`` while its
internal hardware self-tests cycle;
* a multi-pass aggregating unit reports ``Erlang(k, λ)`` over the
pass count.
A reference station with a calibrated lab-grade instrument
contributes deterministic readings.
Regulatory categories partition the value axis: *Good* below 12,
*Moderate* between 12.1 and 35, *Unhealthy* above 35.1 (loosely
following the US EPA AQI breakpoints for PM2.5 in their pre-2024
form, simplified to three tiers). Each station has a Bernoulli
probability of being in calibration on a given day. A separate batch table of *historical* readings carries
the same shape so cross-batch queries via ``UNION ALL`` are
meaningful.
Your tasks:
* inspect the per-row distributions and the rewriter's effect on
threshold queries;
* compute the probability that each station's reading exceeds an
*Unhealthy* threshold, exercising the planner-hook rewrite for
``WHERE reading > 35``;
* model calibration uncertainty as a Bernoulli mixture and inspect
the resulting ``gate_mixture`` shape;
* aggregate per-district readings and watch the simplifier fold
the mixture cascade;
* run conditional inference (``E[reading | reading > 35]``) and
see the closed-form truncated-distribution mean against the
unconditional one;
* filter on the expected value of an aggregated random variable,
combine today's and yesterday's batches with ``UNION ALL``, and
compare probability methods (``'independent'`` vs
``'monte-carlo'`` vs ``'tree-decomposition'``) side by side.
Setup
-----
This case study assumes a working ProvSQL installation
(see :doc:`getting-provsql`) and a running ProvSQL Studio
session pointed at it (see :doc:`studio`). Download
:download:`setup.sql <../../casestudy6/setup.sql>` and load it
into a fresh PostgreSQL database::
createdb air_quality_demo
psql -d air_quality_demo -f setup.sql
The script creates the schema below and seeds the random-variable
readings via the constructors documented in
:doc:`continuous-distributions`. It is five tables:
* ``stations(id, name, district)`` – four monitoring stations
across two districts, provenance-tracked.
* ``readings(station_id, ts, pm25 random_variable)`` – one
``pm25`` reading per station per timestamp; the
``random_variable`` carries the per-station noise model
(normal, uniform, exponential, erlang, or a deterministic
lifted from the reference station).
* ``calibration_status(station_id, p)`` – Bernoulli probability
that each station is in calibration on the day of interest.
* ``categories(name, lo, hi)`` – three regulatory categories
(*Good* / *Moderate* / *Unhealthy*) keyed by their interval
bounds.
* ``historical_readings(...)`` – same shape as ``readings``,
populated from yesterday's batch.
Connect Studio to the fixture::
provsql-studio --dsn postgresql:///air_quality_demo
and open `http://127.0.0.1:8000/ `_ in a
browser. The schema panel lists the fixture's six relations: the
four provenance-tracked tables (``stations``,
``calibration_status``, ``readings``, ``historical_readings``)
carry the purple :sc:`prov` pill, ``categories`` is plain, and
``station_mapping`` is tagged :sc:`mapping`. The ``pm25`` column
on ``readings`` and ``historical_readings`` is flagged with a
terracotta :sc:`rv` pill: a heads-up that comparison and
arithmetic operators on this column are intercepted by the
planner hook and lifted into provenance gates, so a query like
``pm25 > 35`` produces a circuit rather than a Boolean.
.. figure:: /_static/casestudy6/schema-panel.png
:alt: Studio schema panel listing readings, historical_readings,
calibration_status, stations (all PROV-tagged), categories
(no provenance), and the station_mapping table, with the
pm25 column on readings and historical_readings flagged
with a small RV pill.
The schema panel opened from the top nav. The four
provenance-tracked tables carry the purple :sc:`prov` pill;
``readings`` and ``historical_readings`` list ``pm25`` with a
terracotta :sc:`rv` pill marking it as a ``random_variable``
column.
Step 1: Inspect a Noisy Reading
--------------------------------
In the Studio query box:
.. code-block:: postgresql
SELECT id, ts, pm25
FROM readings
WHERE station_id = 's1'
ORDER BY ts
The result table renders ``pm25`` as a clickable ``random_variable``
cell carrying the underlying gate UUID. Click into a row's
``pm25``: Studio switches to Circuit mode and renders the
``gate_rv`` leaf with the distribution-kind initial in the circle
(*N* for a Normal, *U* for Uniform, *Exp* for Exponential, *Erl*
for Erlang).
Pick *Distribution profile* from the *Distribution* group of the
eval strip and click :guilabel:`Run`: the panel returns
:math:`\mu` and :math:`\sigma^2` headline stats and an inline
histogram with a PDF/CDF toggle.
The histogram is backed server-side by :sqlfunc:`rv_histogram`;
pinning ``provsql.monte_carlo_seed`` in the Config panel (under
*Provenance*) makes the shape reproducible across re-runs.
.. figure:: /_static/casestudy6/gate-rv-distribution-profile.png
:alt: Studio Circuit mode showing the gate_rv N(28,2) leaf at
the top of the canvas and the Distribution profile
eval-strip panel below, with mu=28, sigma=2, support
(-infinity, +infinity) and an inline histogram overlaid
by a terracotta analytical PDF curve.
The ``gate_rv`` leaf for ``pm25`` on row 1 is a ``N(28, 2)``
circle; the eval-strip *Distribution profile* panel shows the
:math:`\mu`, :math:`\sigma`, support, and an inline histogram.
When the gate has a closed-form family (here a bare Normal),
Studio overlays the analytical PDF on top of the bars; the
curve rides the histogram envelope so any mismatch between the
sampled histogram and the closed-form shape is immediately
visible.
Step 2: A First Probabilistic Threshold
----------------------------------------
The :math:`PM_{2.5}` *Unhealthy* category begins at 35.1. Find
the rows whose reading might cross it:
.. code-block:: postgresql
SELECT id, station_id, ts
FROM readings
WHERE pm25 > 35
Because ``pm25`` is a random variable, the comparison is not a
yes/no test: it stands for the *event* "this reading exceeds 35",
and ProvSQL attaches that event to each row's provenance.
Click into a result row's auto-added ``provsql`` cell. Circuit mode shows the
Boolean wrapper (a ``gate_times`` over the row's input token and
the ``gate_cmp``); the cmp's child link reaches into the
``gate_rv`` from Step 1.
The eval strip's :sqlfunc:`probability_evaluate` entry exposes the
five compiled methods (see :doc:`probabilities`). Pick
``monte-carlo`` and set ``n = 10000``; the panel returns the
probability with a Hoeffding confidence band. Pin
``provsql.monte_carlo_seed = 42`` in the Config panel and re-run:
the result is now identical across runs. Toggle the seed back to
``-1`` and re-run to see the band shift between runs.
.. figure:: /_static/casestudy6/cmp-circuit-mc-eval.png
:alt: Studio Circuit mode showing the gate_times wrapper above
the iota input gate (left) and the > gate_cmp (right);
the cmp's children are the N(28,2) gate_rv and the
constant 35.
The provenance of one row from ``WHERE pm25 > 35``: a
``gate_times`` (``⊗``) wraps the row's input token ``ι`` and a
``gate_cmp`` ``>`` whose children are the ``N(28, 2)`` leaf and
the constant ``35``. The eval strip below switches to
``probability_evaluate`` and exposes the method picker.
Step 3: The Simplifier in Action
---------------------------------
The planner hook emits the comparator as a raw ``gate_cmp``
regardless of what its operands look like. A *simplifier* pass
then folds comparators whose answer can be decided from the
operand support alone — for example, ``U(10, 22) > 35`` is
universally false because the uniform's upper bound is below the
threshold. The fold is controlled by ``provsql.simplify_on_load``
(default on), which the Config panel exposes under *Provenance*.
Click into row 2's auto-added ``provsql`` cell from the Step 2
result (station ``s2``, ``pm25 ~ U(10, 22)``). With
``provsql.simplify_on_load`` on, the canvas shows a single
``𝟘`` (zero) gate: the simplifier resolved the comparator to a
constant-false leaf and dropped the whole subtree. Toggle the
GUC off in the Config panel and click the cell again: the canvas
now shows the raw construction shape, a ``gate_times`` (``⊗``)
over the row's input token ``ι`` and a ``gate_cmp`` (``>``)
whose children are the ``U(10, 22)`` leaf and the constant
``35``. Both views are semantically identical; the simplified
view is what the semiring evaluators and the Monte-Carlo sampler
actually consume.
.. figure:: /_static/casestudy6/simplify-before-after.png
:alt: Side-by-side: on the left, the raw circuit for row 2's
provenance with provsql.simplify_on_load off, showing a
gate_times over an input token and a > gate_cmp whose
children are U(10,22) and 35; on the right, the same
circuit with the GUC on collapsed to a single zero gate.
Row 2's provenance for ``pm25 > 35`` with
``provsql.simplify_on_load`` toggled off (left) vs on (right).
The simplifier recognised that the upper bound of
``U(10, 22)`` is below the threshold, so the comparator is
universally false and the whole subtree collapses to the
additive identity ``𝟘``.
Step 4: Calibration via Mixtures
---------------------------------
Each station has a probability of being mis-calibrated; a
mis-calibrated unit over-reports by 20% (the *reading* it
records is ``1.2`` times the true value). The corrected estimate
of the true reading is therefore ``pm25`` with probability ``p``
(the station is in spec) and ``pm25 / 1.2`` with probability
``1 - p`` (the report needs to be scaled back). Express this as
a Bernoulli mixture:
.. code-block:: postgresql
SELECT r.id, r.station_id,
provsql.mixture(cs.p, r.pm25, r.pm25 / 1.2) AS pm25_calibrated
FROM readings r JOIN calibration_status cs USING (station_id)
WHERE r.station_id = 's1'
Click into a result row's ``pm25_calibrated`` cell. Circuit mode
renders the ``gate_mixture`` as a ``Mix`` node with three
labelled outgoing edges (``p`` / ``x`` / ``y``) matching the SQL
constructor's argument order: ``p`` points to the Bernoulli
mixing probability, ``x`` to the in-spec arm, and ``y`` to the
correction arm.
The same node-inspector panel exposes ``Distribution profile``
on the mixture root. Because station ``s1`` is in spec 95% of
the time, the histogram is dominated by the ``N(28, 2)`` arm and
the out-of-spec ``N(23.33, 1.667)`` contributes only a small
left shoulder rather than a visually distinct second mode; the
panel headline reflects this with a mixture mean slightly below
28. To see clear bimodality, re-run the query with a larger
calibration error — e.g. replace ``r.pm25 / 1.2`` with
``r.pm25 / 2.0`` so the out-of-spec arm folds to ``N(14, 1)``,
well separated from the in-spec ``N(28, 2)``; the two peaks
then show up distinctly on the histogram even at the 95%/5%
weighting.
.. figure:: /_static/casestudy6/mixture-node.png
:alt: Mix node with three labelled outgoing edges (p, x, y);
the p child is a 95% Bernoulli, the x child is the
N(28,2) reading, and the y child is the back-scaled
N(23.33,1.667) folded by the simplifier.
The ``gate_mixture`` for the calibrated reading. The
``95%`` child is the Bernoulli probability that station ``s1``
is in spec; the ``x`` arm is the raw reading ``N(28, 2)``; the
``y`` arm is the back-scaled estimate ``pm25 / 1.2``, which the
simplifier folded through the Normal affine-shift rule into a
single ``gate_rv`` ``N(23.33, 1.667)``. Circle labels show four
significant figures; the inspector pinned to either child
surfaces the full-precision parameters.
Step 5: Aggregation Over Random Variables
------------------------------------------
Compute average :math:`PM_{2.5}` per district:
.. code-block:: postgresql
SELECT s.district,
avg(r.pm25) AS avg_pm25,
sum(r.pm25) AS total_pm25
FROM readings r JOIN stations s ON s.id = r.station_id
GROUP BY s.district
Click into a row's ``avg_pm25`` cell. Circuit mode shows the
:sqlfunc:`avg` lowering: a ``gate_arith(DIV, num, denom)``
over two ``gate_arith(PLUS, …)`` subtrees, each child a per-row
``gate_mixture`` produced by ``rv_aggregate_semimod``.
The right child of the outer division is the count of *included*
rows under their per-row provenance: rows whose provenance is
false contribute the additive identity to both numerator and
denominator. Run *Distribution profile* on the root: the panel
shows the per-district average as a tight distribution centred at
the inclusion-weighted mean.
.. figure:: /_static/casestudy6/sum-of-mixtures.png
:alt: DAG with a single root division node, two PLUS subtrees
under it, eight Mix nodes at the next level, and the
gate_rv leaves (N(28,2), N(40,4), U(12,24), U(10,22))
plus iota input gates at the bottom.
The ``avg(pm25)`` cell for the *centre* district lowers to
``gate_arith(DIV, gate_arith(PLUS, mixtures), gate_arith(PLUS,
one-mixtures))``. The eight mixtures correspond to the four
stations × two timestamps that fall in the district; the
``gate_rv`` leaves at the bottom are the per-reading
distributions; the ``ι`` leaves anchor each row's provenance.
Step 6: Conditional Inference
------------------------------
Re-open the filtered query from Step 2:
.. code-block:: postgresql
SELECT id, station_id, ts, pm25
FROM readings
WHERE pm25 > 35
AND station_id = 's1'
Click a result row's ``pm25`` cell. The eval strip's
:guilabel:`Condition on` text input auto-presets to the row's
provenance UUID, and the :guilabel:`Conditioned by:` badge
underneath the input is active. Pick *Distribution profile* and
run: the histogram now shows the *truncated* shape, restricted to
the tail above ``35``. Pick *Moment* with ``k = 1`` and ``raw``:
the panel returns the closed-form Mills-ratio mean of the
`truncated normal
`_,
exactly :math:`\mu + \sigma \cdot
\frac{\phi(\alpha)}{1 - \Phi(\alpha)}` with
:math:`\alpha = (35 - \mu)/\sigma`. Click the active badge to
clear the conditioning; the panel reverts to the unconditional
mean :math:`\mu`. Click the muted badge to restore the row
provenance.
The closed-form truncation table covers Normal (Mills ratio),
Uniform (intersected support), and Exponential (memorylessness on
a lower bound or finite-interval truncation). For other shapes,
the joint circuit between ``pm25`` and the row's provenance is
loaded with shared ``gate_rv`` leaves correctly coupled, and the
conditional moment is estimated by rejection sampling at budget
``provsql.rv_mc_samples``.
.. figure:: /_static/casestudy6/condition-on-active.png
:alt: The Distribution profile panel showing supp [35, +infinity],
mu approximately 40.82, sigma approximately 3.35; the
Condition on input is populated with the row's provenance
UUID and the Conditioned by badge is active.
The conditional distribution profile for row 5 (``pm25 ∼
N(40, 4)``) under the event ``pm25 > 35``. Studio auto-presets
the *Condition on* input with the row's provenance UUID and
activates the *Conditioned by* badge; the panel's header
reflects the truncated support ``[35, +∞]`` and the
Mills-ratio mean ``μ ≈ 40.82``, ``σ ≈ 3.35`` (closed form on
the truncated normal).
Step 7: Diagnostic Sampling
----------------------------
For raw inspection or downstream analytics, draw samples from the
conditional distribution. With the same row pinned and the
*Conditioned by* badge active, pick *Sample* from the
*Distribution* group; set ``n = 200`` and run. The result panel
shows a six-value inline preview with a "show full list"
expander; clicking it dumps all 200 samples.
For shapes that fall outside the closed-form table the sampler
falls back to rejection sampling at the
``provsql.rv_mc_samples`` budget; if the conditioning event is
so unlikely that fewer than ``n`` samples land inside that
budget, the panel surfaces a hint pointing at the GUC, e.g.
*Only 47 samples accepted within budget 10000; widen*
``provsql.rv_mc_samples`` *or loosen the conditioning*.
Re-running with a larger budget (set ``rv_mc_samples = 50000``
in the Config panel) recovers the full batch.
Step 8: Combining Batches via UNION
------------------------------------
Both batches share the same id space (rows ``1`` through ``8``,
one per ``(station, timestamp)`` slot), so a ``UNION`` (without
``ALL``) over ``(station_id, id)`` deduplicates a slot to a
single result row whose provenance combines today's reading and
yesterday's reading via the semiring addition. With
``WHERE pm25 > 35`` lifted on each branch, each contributing row
carries a ``gate_cmp(pm25 > 35)`` of its own:
.. code-block:: postgresql
(SELECT station_id, id FROM readings WHERE pm25 > 35)
UNION
(SELECT station_id, id FROM historical_readings WHERE pm25 > 35)
ORDER BY station_id, id
Pick a result row's ``provsql`` cell: Circuit mode shows a
``gate_plus`` (``⊕``) over the two contributing inputs (today's
row and yesterday's row), each carrying its own
``gate_cmp(pm25 > 35)`` from the lifted ``WHERE``.
``probability_evaluate(provenance())`` on the result gives the
probability that *at least one* of the two days produced an
Unhealthy reading for that slot. We deliberately keep the
``random_variable`` ``pm25`` column out of the ``SELECT``: there
is no duplicate-elimination semantics for ``random_variable``,
so a ``UNION`` over an RV column would have no well-defined
meaning.
Step 9: Filtering Grouped Random Variables by Expected Value
-------------------------------------------------------------
Filter the per-district aggregates from Step 5 by their expected
average. Because :sqlfunc:`avg` over a ``random_variable`` column
returns a ``random_variable`` (not an ``agg_token``), and
:sqlfunc:`expected` collapses it to a plain ``double``, the
HAVING qual is deterministic from the planner-hook's perspective;
the rewrite leaves it for PostgreSQL to evaluate natively while
still adding a ``delta(gate_agg)`` wrapper to each surviving
group's provenance:
.. code-block:: postgresql
SELECT s.district, avg(r.pm25) AS avg_pm25
FROM readings r JOIN stations s ON s.id = r.station_id
GROUP BY s.district
HAVING expected(avg(r.pm25)) > 20
The inner :sqlfunc:`avg` is recognised as a ``random_variable``
aggregate (gate_arith DIV over per-row gate_mixture children, as
in Step 5); :sqlfunc:`expected` collapses the distribution to its
mean (Monte Carlo here, since the DIV gate has no closed-form
evaluator); the ``> 20`` is a plain comparison on a ``double``,
so the row survives iff its expected average exceeds the
threshold. For the case-study fixture both districts pass
(centre at ≈ 25.5, east at ≈ 21.6); clicking either result row's
``provsql`` cell shows the ``delta(gate_agg)`` shape, identical
to the no-HAVING aggregate from Step 5 but filtered to the
surviving groups.
Step 10: Independent vs Monte Carlo
------------------------------------
For threshold queries whose contributing rows have structurally
independent provenance, the ``'independent'`` probability method
(see :doc:`probabilities`) is *exact* and far cheaper than Monte
Carlo. Compare the three available exact methods against
``monte-carlo`` on the Step 2 query:
.. code-block:: postgresql
SELECT id,
probability_evaluate(provenance(), 'independent') AS p_ind,
probability_evaluate(provenance(), 'monte-carlo', '10000') AS p_mc,
probability_evaluate(provenance(), 'tree-decomposition') AS p_td
FROM readings WHERE pm25 > 35
Studio's eval strip exposes these methods directly; running each
method against the same pinned subnode shows the analytic
``independent`` and ``tree-decomposition`` returning the same
value to full precision, while ``monte-carlo`` returns a
Hoeffding-bounded estimate that tightens as ``n`` grows.
See :doc:`continuous-distributions` for the full surface and
:doc:`studio` for the Studio reference.