Case Study: The Wildlife Photo Archive ======================================== This case study, loosely inspired by a ProvSQL demonstration at EDBT 2025 :cite:`DBLP:conf/edbt/YunusKSAB25`, applies ProvSQL to a database of wildlife photographs annotated by a species-detection model. It demonstrates the ``VALUES`` clause, :sqlfunc:`repair_key` and the ``mulinput`` gate, ranking by probability versus thresholding, ``EXCEPT``, common table expressions, and :sqlfunc:`expected` aggregates. The Scenario ------------ A naturalist organisation maintains a database of wildlife photographs taken at four field stations in the Scottish Highlands. Each photo has been processed by a species-detection model that draws one or more *bounding boxes* around things it thinks are animals, and for each box reports a list of candidate species with a confidence score. A box can therefore appear with several species candidates (e.g. a partly-occluded shape might score 0.40 as red deer and 0.30 as roe deer); a photo can contain several boxes of the same species (e.g. three deer in a meadow shot). Your tasks: * find photos that contain specific combinations of species, * rank results by the probability that the combination is truly present, * compare probabilistic ranking against naive confidence thresholding, * exclude photos that contain unwanted species, * compute expected species counts per photo. Setup ----- This case study assumes a working ProvSQL installation (see :doc:`getting-provsql`). Download :download:`setup.sql <../../casestudy5/setup.sql>` and load it into a fresh PostgreSQL database:: psql -d mydb -f setup.sql This creates three tables: * ``photo`` – 30 wildlife photographs, each tagged with a station name (Loch Torridon, Glen Affric, Rannoch Moor, or Cairngorms) and a date * ``species`` – 13 species across mammals, birds, and reptiles * ``detection`` – about 60 model-produced (bounding-box, species) candidate rows, each linking a photo and a bounding-box index to a candidate species with a confidence score; multiple rows for the same (``photo_id``, ``bbox_id``) pair represent the classifier's alternative species hypotheses for that single bounding box Step 1: Explore the Database ----------------------------- At the start of every session, set the search path: .. code-block:: postgresql SET search_path TO public, provsql; Inspect the tables. Note that ``detection`` is *not* keyed on (``photo_id``, ``bbox_id``): a single bounding box can appear in several rows, one per candidate species the classifier considered. .. code-block:: postgresql SELECT * FROM photo ORDER BY id LIMIT 5; SELECT * FROM species ORDER BY id; SELECT photo_id, bbox_id, species_id, confidence FROM detection WHERE photo_id IN (5, 9, 14, 22) ORDER BY photo_id, bbox_id, confidence DESC; Step 2: Enable Provenance and Create a Name Mapping ---------------------------------------------------- Enable provenance tracking on ``detection``. Each row receives a UUID circuit token that propagates through any downstream query. .. code-block:: postgresql SELECT add_provenance('detection'); To get readable formulas, we want to associate each detection's provenance token with its species name. A *provenance mapping* in ProvSQL is nothing more than a regular table with two columns named ``value`` and ``provenance`` (plus, for performance, an index on ``provenance``). The convenience function :sqlfunc:`create_provenance_mapping` builds such a table from one column of a provenance-enabled relation, but nothing prevents us from constructing the table by hand: .. code-block:: postgresql CREATE TABLE species_mapping AS SELECT s.name AS value, d.provsql AS provenance FROM detection d JOIN species s ON s.id = d.species_id; SELECT remove_provenance('species_mapping'); CREATE INDEX ON species_mapping(provenance); The ``CREATE TABLE AS`` query inherits a ``provsql`` column from ``detection`` via ProvSQL's planner hook; :sqlfunc:`remove_provenance` strips that extra column so only the ``(value, provenance)`` pair remains. Because the schema is fully under our control, we can populate the table from any expression – combine columns, filter rows, derive computed values – and any semiring-evaluation function (``sr_formula``, ``sr_why``, …) will happily consume the result. Step 3: Inline Lookup with ``VALUES`` --------------------------------------- A ``VALUES`` clause defines an inline relation directly inside a query. ProvSQL's planner hook treats it like any other source: rows it produces have no provenance of their own, but they carry through joins so that the result inherits provenance from the joined provenance-enabled rows. Here we use ``VALUES`` to define a small ad-hoc watchlist: a couple of species we want to look up by hand together with a human-readable label, instead of pulling them from the ``species`` table. Suppose we are interested in Red Deer (``species_id`` 1, the dominant grazer whose density we want to track) and Red Fox (``species_id`` 3, a generalist predator), and we want to tag the rows accordingly: .. code-block:: postgresql SELECT v.label, p.id, p.station, p.date, sr_formula(provenance(), 'species_mapping') AS formula FROM (VALUES (1, 'mammal of interest'), (3, 'predator of interest')) AS v(species_id, label), detection d, photo p WHERE d.species_id = v.species_id AND d.photo_id = p.id ORDER BY p.id, v.label; Each output row carries the provenance of the underlying ``detection`` row alone: the formula is a single species token, even though the row also references ``photo`` and ``VALUES``. Tables without provenance – including the ``VALUES`` rows – contribute no tokens. Step 4: Conjunctive Query (Naive) ----------------------------------- Find photos that contain both Red Deer (``species_id`` 1) and Red Fox (``species_id`` 3) using a self-join on ``detection``: .. code-block:: postgresql SELECT p.id, p.station, p.date, sr_formula(provenance(), 'species_mapping') AS formula FROM detection d1 JOIN detection d2 ON d1.photo_id = d2.photo_id JOIN photo p ON p.id = d1.photo_id WHERE d1.species_id = 1 AND d2.species_id = 3 GROUP BY p.id, p.station, p.date ORDER BY p.id; Look at photo 5: the classifier produced three Red Deer candidate rows (in three different bounding boxes) and two Red Fox candidate rows (in two more boxes). Its formula is the ⊕-sum of all six (deer, fox) pair products – every candidate row is an independent input gate. This matches the structure of the underlying ``detection`` table but mis-models the data: each bounding box can correspond to *at most one real animal*, so candidate rows that share a ``(photo_id, bbox_id)`` pair should be mutually exclusive rather than independent. Nothing in the schema enforces that today, and the formula reflects the mismatch. Step 5: Mutually Exclusive Candidates with ``repair_key`` ------------------------------------------------------------ :sqlfunc:`repair_key` rewrites the provenance so that rows sharing a key become alternatives under a single ``mulinput`` (multivalued input) gate – i.e. *exactly one* of them is true. Applied with the key ``(photo_id, bbox_id)``, every bounding box becomes one mulinput variable whose values are the candidate species the classifier considered for that box. :sqlfunc:`repair_key` takes a single key attribute, so add a synthetic key combining photo and bounding-box index first. ``repair_key`` reinstalls the ``provsql`` column itself, so also drop the old mapping (whose tokens are about to become stale): .. code-block:: postgresql DROP TABLE species_mapping; SELECT remove_provenance('detection'); ALTER TABLE detection ADD COLUMN photo_bbox text; UPDATE detection SET photo_bbox = photo_id || '/' || bbox_id; SELECT repair_key('detection', 'photo_bbox'); CREATE TABLE species_mapping AS SELECT s.name AS value, d.provsql AS provenance FROM detection d JOIN species s ON s.id = d.species_id; SELECT remove_provenance('species_mapping'); CREATE INDEX ON species_mapping(provenance); Re-running the conjunctive query from Step 4 with :sqlfunc:`sr_formula` would not be illuminating: mutually exclusive events have no meaningful representation in the symbolic-formula semiring (each ``mulinput`` just collapses to ``𝟙``). To visualize them we use :sqlfunc:`sr_boolexpr` instead, which renders the underlying Boolean formula with internal variable names and exposes each ``mulinput`` explicitly: .. code-block:: postgresql SELECT p.id, sr_boolexpr(provenance()) AS bexpr FROM detection d1 JOIN detection d2 ON d1.photo_id = d2.photo_id JOIN photo p ON p.id = d1.photo_id WHERE d1.species_id = 1 AND d2.species_id = 3 AND p.id IN (2, 5) GROUP BY p.id ORDER BY p.id; Each input now appears as a ``mulinput`` (the ``{i=v}[p]`` notation denotes "variable ``i`` takes value ``v`` with probability ``p``"). In this query every variable happens to have a single value, so the mutually-exclusive structure is not visible yet. We will see a genuine multi-valued ``mulinput`` in the next step, where the deer/roe-deer candidates of one bounding box compose under the mutex constraint. .. note:: :sqlfunc:`view_circuit` cannot render ``mulinput`` gates either and refuses to evaluate. Use :sqlfunc:`sr_boolexpr` to inspect circuits that contain ``mulinput`` gates, and :sqlfunc:`probability_evaluate` to score them. Step 6: Assign Probabilities and Verify Mutual Exclusion ---------------------------------------------------------- Each candidate row's confidence becomes the probability that that classifier candidate is the true species for its bounding box: .. code-block:: postgresql DO $$ BEGIN PERFORM set_prob(provenance(), confidence) FROM detection; END $$; To see that ``repair_key`` made a numerical difference, ask: *what is the probability that bounding box 1 of photo 5 corresponds to a deer-like animal* (``species_id`` 1 = Red Deer or 2 = Roe Deer)? In the data, that bounding box has both candidate species recorded with confidences 0.40 and 0.30: .. code-block:: postgresql SELECT photo_id, bbox_id, sr_boolexpr(provenance()) AS bexpr, ROUND(probability_evaluate(provenance(), 'tree-decomposition')::numeric, 4) AS p FROM detection WHERE photo_id = 5 AND bbox_id = 1 AND species_id IN (1, 2) GROUP BY photo_id, bbox_id; The Boolean expression is now ``({0=1}[0.400000] ∨ {0=2}[0.300000])``: a single mulinput variable (variable 0) with two mutually exclusive values, value 1 (Red Deer) with probability 0.40 and value 2 (Roe Deer) with probability 0.30. Probability evaluation gives ``0.7000``, the sum of the two confidences, since combining mutually exclusive events with ⊕ is just addition. Had we kept the original :sqlfunc:`add_provenance` setup with each row as an independent input gate, the same query would have given ``1 - (1 - 0.40) × (1 - 0.30) = 0.58`` instead. The 0.12 gap is the practical effect of telling the engine "these candidates cannot both be true at once". Step 7: Probabilistic Ranking vs. Threshold Filtering ------------------------------------------------------- Run the conjunctive query under two ranking strategies. First, by probability that *both* species are truly present: .. code-block:: postgresql SELECT p.id, p.station, p.date, ROUND(probability_evaluate(provenance())::numeric, 4) AS prob FROM detection d1 JOIN detection d2 ON d1.photo_id = d2.photo_id JOIN photo p ON p.id = d1.photo_id WHERE d1.species_id = 1 AND d2.species_id = 3 GROUP BY p.id, p.station, p.date ORDER BY prob DESC, p.id; Second, by raw confidence threshold (every contributing detection must score at least 0.5): .. code-block:: postgresql SELECT DISTINCT p.id, p.station, p.date FROM detection d1 JOIN detection d2 ON d1.photo_id = d2.photo_id JOIN photo p ON p.id = d1.photo_id WHERE d1.species_id = 1 AND d2.species_id = 3 AND d1.confidence >= 0.5 AND d2.confidence >= 0.5 ORDER BY p.id; Photo 5 is missed by thresholding (every individual candidate there scores below 0.5) but ranks reasonably under :sqlfunc:`probability_evaluate`, because the weak deer candidates in several boxes plus the weak fox candidates combine into a non-trivial probability that *some* deer and *some* fox are truly there. Conversely, a photo whose top candidates only barely cross 0.5 passes the threshold but ends up low in the probability ranking. Step 8: Absence Constraint with ``EXCEPT`` -------------------------------------------- Find photos that contain a Red Deer but no Domestic Dog (``species_id`` 13). ``EXCEPT`` is implemented in ProvSQL via the ⊖ (monus) operator on the provenance circuit: .. code-block:: postgresql SELECT p.id, p.station, p.date, ROUND(probability_evaluate(provenance())::numeric, 4) AS prob FROM ( SELECT photo_id FROM detection WHERE species_id = 1 EXCEPT SELECT photo_id FROM detection WHERE species_id = 13 ) t JOIN photo p ON p.id = t.photo_id GROUP BY p.id, p.station, p.date ORDER BY prob DESC, p.id; Photos that contain a dog still appear in the output – ``EXCEPT`` is *not* a hard filter. Photo 9, with a high-confidence dog detection, ranks lower because the monus discounts strongly. Photo 14, where the dog detection has very low confidence, ranks higher: it is *probably* in the result, but not certainly. ProvSQL preserves both possibilities in the circuit and lets :sqlfunc:`probability_evaluate` weigh them. Step 9: Multi-Condition Query via a CTE ----------------------------------------- Combine Steps 7 and 8: photos with both Red Deer and Red Fox, with no Domestic Dog, ranked by probability. The query has three logical layers (co-occurrence, absence, ranking) and reads naturally as a CTE: .. code-block:: postgresql WITH deer_and_fox AS ( SELECT d1.photo_id FROM detection d1 JOIN detection d2 ON d1.photo_id = d2.photo_id WHERE d1.species_id = 1 AND d2.species_id = 3 GROUP BY d1.photo_id ), no_dogs AS ( SELECT photo_id FROM deer_and_fox EXCEPT SELECT photo_id FROM detection WHERE species_id = 13 ) SELECT p.id, p.station, p.date, ROUND(probability_evaluate(provenance())::numeric, 4) AS prob FROM no_dogs t JOIN photo p ON p.id = t.photo_id ORDER BY prob DESC, p.id; ProvSQL's planner hook fires on the expanded query: CTEs are inlined and provenance propagates through them transparently. The same answer can be written with nested subqueries; the CTE form is purely a readability choice. Step 10: Expected Species Counts with :sqlfunc:`expected` ----------------------------------------------------------- How many distinct detections do we *expect* to be true positives in each photo? .. code-block:: postgresql SELECT p.id, p.station, ROUND(expected(COUNT(*))::numeric, 4) AS exp_detections FROM detection d JOIN photo p ON p.id = d.photo_id GROUP BY p.id, p.station ORDER BY exp_detections DESC, p.id; By linearity of expectation, ``expected(COUNT(*))`` over a group is :math:`\sum_i P(\text{detection}_i \text{ is true})`. The same linearity applies to ``SUM`` aggregates: the expected total confidence mass per photo: .. code-block:: postgresql SELECT p.id, p.station, ROUND(expected(SUM(d.confidence))::numeric, 4) AS exp_total_conf FROM detection d JOIN photo p ON p.id = d.photo_id GROUP BY p.id, p.station ORDER BY exp_total_conf DESC, p.id; Both queries use ProvSQL's :sqlfunc:`expected` operator, which computes the expected value of a SQL aggregate over the probabilistic database defined by the per-row probabilities set in Step 6. Photos with many high-confidence detections rank highest on both metrics.