Case Study: The City Air-Quality Sensor Network
=================================================

This case study demonstrates ProvSQL's continuous-distribution
surface (see :doc:`continuous-distributions`) end-to-end through
ProvSQL Studio (see :doc:`studio`). It is the first case study
driven primarily by Studio rather than ``psql``: random variables
benefit far more from interactive visualisation – PDFs, CDFs,
mixture DAG layouts, conditional histograms, simplifier
before-vs-after – than from text-mode output, and the workflow
below makes the rewriter, the simplifier, the analytic and
Monte-Carlo paths, and conditional inference all visible in the
canvas.

The Scenario
------------

A municipal observatory operates a small air-quality sensor
network. Sensors of three different vendors report a
:math:`PM_{2.5}` concentration (*fine particulate matter*, i.e.
airborne particles with aerodynamic diameter at most 2.5 μm,
expressed in micrograms per cubic metre) on a fixed schedule. The sensors differ in calibration and noise
characteristics:

* high-end units report ``Normal(μ, σ)`` with small σ;
* low-cost units report ``Uniform[μ−δ, μ+δ]`` over a small window;
* a drift-prone unit reports ``Exponential(λ)`` while its
  internal hardware self-tests cycle;
* a multi-pass aggregating unit reports ``Erlang(k, λ)`` over the
  pass count.

A reference station with a calibrated lab-grade instrument
contributes deterministic readings.

Regulatory categories partition the value axis: *Good* below 12,
*Moderate* between 12.1 and 35, *Unhealthy* above 35.1 (loosely
following the US EPA AQI breakpoints for PM2.5 in their pre-2024
form, simplified to three tiers). Each station has a Bernoulli
probability of being in calibration on a given day. A separate batch table of *historical* readings carries
the same shape so cross-batch queries via ``UNION ALL`` are
meaningful.

Your tasks:

* inspect the per-row distributions and the rewriter's effect on
  threshold queries;
* compute the probability that each station's reading exceeds an
  *Unhealthy* threshold, exercising the planner-hook rewrite for
  ``WHERE reading > 35``;
* model calibration uncertainty as a Bernoulli mixture and inspect
  the resulting ``gate_mixture`` shape;
* aggregate per-district readings and watch the simplifier fold
  the mixture cascade;
* run conditional inference (``E[reading | reading > 35]``) and
  see the closed-form truncated-distribution mean against the
  unconditional one;
* filter on the expected value of an aggregated random variable,
  combine today's and yesterday's batches with ``UNION ALL``, and
  compare probability methods (``'independent'`` vs
  ``'monte-carlo'`` vs ``'tree-decomposition'``) side by side.

Setup
-----

This case study assumes a working ProvSQL installation
(see :doc:`getting-provsql`) and a running ProvSQL Studio
session pointed at it (see :doc:`studio`). Download
:download:`setup.sql <../../casestudy6/setup.sql>` and load it
into a fresh PostgreSQL database::

    createdb air_quality_demo
    psql -d air_quality_demo -f setup.sql

The script creates the schema below and seeds the random-variable
readings via the constructors documented in
:doc:`continuous-distributions`. It is five tables:

* ``stations(id, name, district)`` – four monitoring stations
  across two districts, provenance-tracked.
* ``readings(station_id, ts, pm25 random_variable)`` – one
  ``pm25`` reading per station per timestamp; the
  ``random_variable`` carries the per-station noise model
  (normal, uniform, exponential, erlang, or a deterministic
  lifted from the reference station).
* ``calibration_status(station_id, p)`` – Bernoulli probability
  that each station is in calibration on the day of interest.
* ``categories(name, lo, hi)`` – three regulatory categories
  (*Good* / *Moderate* / *Unhealthy*) keyed by their interval
  bounds.
* ``historical_readings(...)`` – same shape as ``readings``,
  populated from yesterday's batch.

Connect Studio to the fixture::

    provsql-studio --dsn postgresql:///air_quality_demo

and open `http://127.0.0.1:8000/ <http://127.0.0.1:8000/>`_ in a
browser. The schema panel lists the fixture's six relations: the
four provenance-tracked tables (``stations``,
``calibration_status``, ``readings``, ``historical_readings``)
carry the purple :sc:`prov` pill, ``categories`` is plain, and
``station_mapping`` is tagged :sc:`mapping`. The ``pm25`` column
on ``readings`` and ``historical_readings`` is flagged with a
terracotta :sc:`rv` pill: a heads-up that comparison and
arithmetic operators on this column are intercepted by the
planner hook and lifted into provenance gates, so a query like
``pm25 > 35`` produces a circuit rather than a Boolean.

.. figure:: /_static/casestudy6/schema-panel.png
   :alt: Studio schema panel listing readings, historical_readings,
         calibration_status, stations (all PROV-tagged), categories
         (no provenance), and the station_mapping table, with the
         pm25 column on readings and historical_readings flagged
         with a small RV pill.

   The schema panel opened from the top nav. The four
   provenance-tracked tables carry the purple :sc:`prov` pill;
   ``readings`` and ``historical_readings`` list ``pm25`` with a
   terracotta :sc:`rv` pill marking it as a ``random_variable``
   column.

Step 1: Inspect a Noisy Reading
--------------------------------

In the Studio query box:

.. code-block:: postgresql

    SELECT id, ts, pm25
    FROM readings
    WHERE station_id = 's1'
    ORDER BY ts

The result table renders ``pm25`` as a clickable ``random_variable``
cell carrying the underlying gate UUID. Click into a row's
``pm25``: Studio switches to Circuit mode and renders the
``gate_rv`` leaf with the distribution-kind initial in the circle
(*N* for a Normal, *U* for Uniform, *Exp* for Exponential, *Erl*
for Erlang).
Pick *Distribution profile* from the *Distribution* group of the
eval strip and click :guilabel:`Run`: the panel returns
:math:`\mu` and :math:`\sigma^2` headline stats and an inline
histogram with a PDF/CDF toggle.

The histogram is backed server-side by :sqlfunc:`rv_histogram`;
pinning ``provsql.monte_carlo_seed`` in the Config panel (under
*Provenance*) makes the shape reproducible across re-runs.

.. figure:: /_static/casestudy6/gate-rv-distribution-profile.png
   :alt: Studio Circuit mode showing the gate_rv N(28,2) leaf at
         the top of the canvas and the Distribution profile
         eval-strip panel below, with mu=28, sigma=2, support
         (-infinity, +infinity) and an inline histogram overlaid
         by a terracotta analytical PDF curve.

   The ``gate_rv`` leaf for ``pm25`` on row 1 is a ``N(28, 2)``
   circle; the eval-strip *Distribution profile* panel shows the
   :math:`\mu`, :math:`\sigma`, support, and an inline histogram.
   When the gate has a closed-form family (here a bare Normal),
   Studio overlays the analytical PDF on top of the bars; the
   curve rides the histogram envelope so any mismatch between the
   sampled histogram and the closed-form shape is immediately
   visible.

Step 2: A First Probabilistic Threshold
----------------------------------------

The :math:`PM_{2.5}` *Unhealthy* category begins at 35.1. Find
the rows whose reading might cross it:

.. code-block:: postgresql

    SELECT id, station_id, ts
    FROM readings
    WHERE pm25 > 35

Because ``pm25`` is a random variable, the comparison is not a
yes/no test: it stands for the *event* "this reading exceeds 35",
and ProvSQL attaches that event to each row's provenance.

Click into a result row's auto-added ``provsql`` cell. Circuit mode shows the
Boolean wrapper (a ``gate_times`` over the row's input token and
the ``gate_cmp``); the cmp's child link reaches into the
``gate_rv`` from Step 1.

The eval strip's :sqlfunc:`probability_evaluate` entry exposes the
five compiled methods (see :doc:`probabilities`). Pick
``monte-carlo`` and set ``n = 10000``; the panel returns the
probability with a Hoeffding confidence band. Pin
``provsql.monte_carlo_seed = 42`` in the Config panel and re-run:
the result is now identical across runs. Toggle the seed back to
``-1`` and re-run to see the band shift between runs.

.. figure:: /_static/casestudy6/cmp-circuit-mc-eval.png
   :alt: Studio Circuit mode showing the gate_times wrapper above
         the iota input gate (left) and the > gate_cmp (right);
         the cmp's children are the N(28,2) gate_rv and the
         constant 35.

   The provenance of one row from ``WHERE pm25 > 35``: a
   ``gate_times`` (``⊗``) wraps the row's input token ``ι`` and a
   ``gate_cmp`` ``>`` whose children are the ``N(28, 2)`` leaf and
   the constant ``35``. The eval strip below switches to
   ``probability_evaluate`` and exposes the method picker.

Step 3: The Simplifier in Action
---------------------------------

The planner hook emits the comparator as a raw ``gate_cmp``
regardless of what its operands look like. A *simplifier* pass
then folds comparators whose answer can be decided from the
operand support alone — for example, ``U(10, 22) > 35`` is
universally false because the uniform's upper bound is below the
threshold. The fold is controlled by ``provsql.simplify_on_load``
(default on), which the Config panel exposes under *Provenance*.

Click into row 2's auto-added ``provsql`` cell from the Step 2
result (station ``s2``, ``pm25 ~ U(10, 22)``). With
``provsql.simplify_on_load`` on, the canvas shows a single
``𝟘`` (zero) gate: the simplifier resolved the comparator to a
constant-false leaf and dropped the whole subtree. Toggle the
GUC off in the Config panel and click the cell again: the canvas
now shows the raw construction shape, a ``gate_times`` (``⊗``)
over the row's input token ``ι`` and a ``gate_cmp`` (``>``)
whose children are the ``U(10, 22)`` leaf and the constant
``35``. Both views are semantically identical; the simplified
view is what the semiring evaluators and the Monte-Carlo sampler
actually consume.

.. figure:: /_static/casestudy6/simplify-before-after.png
   :alt: Side-by-side: on the left, the raw circuit for row 2's
         provenance with provsql.simplify_on_load off, showing a
         gate_times over an input token and a > gate_cmp whose
         children are U(10,22) and 35; on the right, the same
         circuit with the GUC on collapsed to a single zero gate.

   Row 2's provenance for ``pm25 > 35`` with
   ``provsql.simplify_on_load`` toggled off (left) vs on (right).
   The simplifier recognised that the upper bound of
   ``U(10, 22)`` is below the threshold, so the comparator is
   universally false and the whole subtree collapses to the
   additive identity ``𝟘``.

Step 4: Calibration via Mixtures
---------------------------------

Each station has a probability of being mis-calibrated; a
mis-calibrated unit over-reports by 20% (the *reading* it
records is ``1.2`` times the true value). The corrected estimate
of the true reading is therefore ``pm25`` with probability ``p``
(the station is in spec) and ``pm25 / 1.2`` with probability
``1 - p`` (the report needs to be scaled back). Express this as
a Bernoulli mixture:

.. code-block:: postgresql

    SELECT r.id, r.station_id,
           provsql.mixture(cs.p, r.pm25, r.pm25 / 1.2) AS pm25_calibrated
    FROM readings r JOIN calibration_status cs USING (station_id)
    WHERE r.station_id = 's1'

Click into a result row's ``pm25_calibrated`` cell. Circuit mode
renders the ``gate_mixture`` as a ``Mix`` node with three
labelled outgoing edges (``p`` / ``x`` / ``y``) matching the SQL
constructor's argument order: ``p`` points to the Bernoulli
mixing probability, ``x`` to the in-spec arm, and ``y`` to the
correction arm.

The same node-inspector panel exposes ``Distribution profile``
on the mixture root. Because station ``s1`` is in spec 95% of
the time, the histogram is dominated by the ``N(28, 2)`` arm and
the out-of-spec ``N(23.33, 1.667)`` contributes only a small
left shoulder rather than a visually distinct second mode; the
panel headline reflects this with a mixture mean slightly below
28. To see clear bimodality, re-run the query with a larger
calibration error — e.g. replace ``r.pm25 / 1.2`` with
``r.pm25 / 2.0`` so the out-of-spec arm folds to ``N(14, 1)``,
well separated from the in-spec ``N(28, 2)``; the two peaks
then show up distinctly on the histogram even at the 95%/5%
weighting.

.. figure:: /_static/casestudy6/mixture-node.png
   :alt: Mix node with three labelled outgoing edges (p, x, y);
         the p child is a 95% Bernoulli, the x child is the
         N(28,2) reading, and the y child is the back-scaled
         N(23.33,1.667) folded by the simplifier.

   The ``gate_mixture`` for the calibrated reading. The
   ``95%`` child is the Bernoulli probability that station ``s1``
   is in spec; the ``x`` arm is the raw reading ``N(28, 2)``; the
   ``y`` arm is the back-scaled estimate ``pm25 / 1.2``, which the
   simplifier folded through the Normal affine-shift rule into a
   single ``gate_rv`` ``N(23.33, 1.667)``. Circle labels show four
   significant figures; the inspector pinned to either child
   surfaces the full-precision parameters.

Step 5: Aggregation Over Random Variables
------------------------------------------

Compute average :math:`PM_{2.5}` per district:

.. code-block:: postgresql

    SELECT s.district,
           avg(r.pm25)  AS avg_pm25,
           sum(r.pm25)  AS total_pm25
    FROM readings r JOIN stations s ON s.id = r.station_id
    GROUP BY s.district

Click into a row's ``avg_pm25`` cell. Circuit mode shows the
:sqlfunc:`avg` lowering: a ``gate_arith(DIV, num, denom)``
over two ``gate_arith(PLUS, …)`` subtrees, each child a per-row
``gate_mixture`` produced by ``rv_aggregate_semimod``.
The right child of the outer division is the count of *included*
rows under their per-row provenance: rows whose provenance is
false contribute the additive identity to both numerator and
denominator. Run *Distribution profile* on the root: the panel
shows the per-district average as a tight distribution centred at
the inclusion-weighted mean.

.. figure:: /_static/casestudy6/sum-of-mixtures.png
   :alt: DAG with a single root division node, two PLUS subtrees
         under it, eight Mix nodes at the next level, and the
         gate_rv leaves (N(28,2), N(40,4), U(12,24), U(10,22))
         plus iota input gates at the bottom.

   The ``avg(pm25)`` cell for the *centre* district lowers to
   ``gate_arith(DIV, gate_arith(PLUS, mixtures), gate_arith(PLUS,
   one-mixtures))``. The eight mixtures correspond to the four
   stations × two timestamps that fall in the district; the
   ``gate_rv`` leaves at the bottom are the per-reading
   distributions; the ``ι`` leaves anchor each row's provenance.

Step 6: Conditional Inference
------------------------------

Re-open the filtered query from Step 2:

.. code-block:: postgresql

    SELECT id, station_id, ts, pm25
    FROM readings
    WHERE pm25 > 35
      AND station_id = 's1'

Click a result row's ``pm25`` cell. The eval strip's
:guilabel:`Condition on` text input auto-presets to the row's
provenance UUID, and the :guilabel:`Conditioned by:` badge
underneath the input is active. Pick *Distribution profile* and
run: the histogram now shows the *truncated* shape, restricted to
the tail above ``35``. Pick *Moment* with ``k = 1`` and ``raw``:
the panel returns the closed-form Mills-ratio mean of the
`truncated normal
<https://en.wikipedia.org/wiki/Truncated_normal_distribution>`_,
exactly :math:`\mu + \sigma \cdot
\frac{\phi(\alpha)}{1 - \Phi(\alpha)}` with
:math:`\alpha = (35 - \mu)/\sigma`. Click the active badge to
clear the conditioning; the panel reverts to the unconditional
mean :math:`\mu`. Click the muted badge to restore the row
provenance.

The closed-form truncation table covers Normal (Mills ratio),
Uniform (intersected support), and Exponential (memorylessness on
a lower bound or finite-interval truncation). For other shapes,
the joint circuit between ``pm25`` and the row's provenance is
loaded with shared ``gate_rv`` leaves correctly coupled, and the
conditional moment is estimated by rejection sampling at budget
``provsql.rv_mc_samples``.

.. figure:: /_static/casestudy6/condition-on-active.png
   :alt: The Distribution profile panel showing supp [35, +infinity],
         mu approximately 40.82, sigma approximately 3.35; the
         Condition on input is populated with the row's provenance
         UUID and the Conditioned by badge is active.

   The conditional distribution profile for row 5 (``pm25 ∼
   N(40, 4)``) under the event ``pm25 > 35``. Studio auto-presets
   the *Condition on* input with the row's provenance UUID and
   activates the *Conditioned by* badge; the panel's header
   reflects the truncated support ``[35, +∞]`` and the
   Mills-ratio mean ``μ ≈ 40.82``, ``σ ≈ 3.35`` (closed form on
   the truncated normal).

Step 7: Diagnostic Sampling
----------------------------

For raw inspection or downstream analytics, draw samples from the
conditional distribution. With the same row pinned and the
*Conditioned by* badge active, pick *Sample* from the
*Distribution* group; set ``n = 200`` and run. The result panel
shows a six-value inline preview with a "show full list"
expander; clicking it dumps all 200 samples.

For shapes that fall outside the closed-form table the sampler
falls back to rejection sampling at the
``provsql.rv_mc_samples`` budget; if the conditioning event is
so unlikely that fewer than ``n`` samples land inside that
budget, the panel surfaces a hint pointing at the GUC, e.g.
*Only 47 samples accepted within budget 10000; widen*
``provsql.rv_mc_samples`` *or loosen the conditioning*.
Re-running with a larger budget (set ``rv_mc_samples = 50000``
in the Config panel) recovers the full batch.

Step 8: Combining Batches via UNION
------------------------------------

Both batches share the same id space (rows ``1`` through ``8``,
one per ``(station, timestamp)`` slot), so a ``UNION`` (without
``ALL``) over ``(station_id, id)`` deduplicates a slot to a
single result row whose provenance combines today's reading and
yesterday's reading via the semiring addition. With
``WHERE pm25 > 35`` lifted on each branch, each contributing row
carries a ``gate_cmp(pm25 > 35)`` of its own:

.. code-block:: postgresql

    (SELECT station_id, id FROM readings            WHERE pm25 > 35)
    UNION
    (SELECT station_id, id FROM historical_readings WHERE pm25 > 35)
    ORDER BY station_id, id

Pick a result row's ``provsql`` cell: Circuit mode shows a
``gate_plus`` (``⊕``) over the two contributing inputs (today's
row and yesterday's row), each carrying its own
``gate_cmp(pm25 > 35)`` from the lifted ``WHERE``.
``probability_evaluate(provenance())`` on the result gives the
probability that *at least one* of the two days produced an
Unhealthy reading for that slot. We deliberately keep the
``random_variable`` ``pm25`` column out of the ``SELECT``: there
is no duplicate-elimination semantics for ``random_variable``,
so a ``UNION`` over an RV column would have no well-defined
meaning.

Step 9: Filtering Grouped Random Variables by Expected Value
-------------------------------------------------------------

Filter the per-district aggregates from Step 5 by their expected
average. Because :sqlfunc:`avg` over a ``random_variable`` column
returns a ``random_variable`` (not an ``agg_token``), and
:sqlfunc:`expected` collapses it to a plain ``double``, the
HAVING qual is deterministic from the planner-hook's perspective;
the rewrite leaves it for PostgreSQL to evaluate natively while
still adding a ``delta(gate_agg)`` wrapper to each surviving
group's provenance:

.. code-block:: postgresql

    SELECT s.district, avg(r.pm25) AS avg_pm25
    FROM readings r JOIN stations s ON s.id = r.station_id
    GROUP BY s.district
    HAVING expected(avg(r.pm25)) > 20

The inner :sqlfunc:`avg` is recognised as a ``random_variable``
aggregate (gate_arith DIV over per-row gate_mixture children, as
in Step 5); :sqlfunc:`expected` collapses the distribution to its
mean (Monte Carlo here, since the DIV gate has no closed-form
evaluator); the ``> 20`` is a plain comparison on a ``double``,
so the row survives iff its expected average exceeds the
threshold. For the case-study fixture both districts pass
(centre at ≈ 25.5, east at ≈ 21.6); clicking either result row's
``provsql`` cell shows the ``delta(gate_agg)`` shape, identical
to the no-HAVING aggregate from Step 5 but filtered to the
surviving groups.

Step 10: Independent vs Monte Carlo
------------------------------------

For threshold queries whose contributing rows have structurally
independent provenance, the ``'independent'`` probability method
(see :doc:`probabilities`) is *exact* and far cheaper than Monte
Carlo. Compare the three available exact methods against
``monte-carlo`` on the Step 2 query:

.. code-block:: postgresql

    SELECT id,
      probability_evaluate(provenance(), 'independent')         AS p_ind,
      probability_evaluate(provenance(), 'monte-carlo', '10000') AS p_mc,
      probability_evaluate(provenance(), 'tree-decomposition')  AS p_td
    FROM readings WHERE pm25 > 35

Studio's eval strip exposes these methods directly; running each
method against the same pinned subnode shows the analytic
``independent`` and ``tree-decomposition`` returning the same
value to full precision, while ``monte-carlo`` returns a
Hoeffding-bounded estimate that tightens as ``n`` grows.

See :doc:`continuous-distributions` for the full surface and
:doc:`studio` for the Studio reference.