{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "provsql-studio", "display_name": "ProvSQL (SQL)", "language": "sql" }, "language_info": { "name": "sql" }, "provsql": { "scheme": "semiring", "database": "cs5", "generated_from": "doc/source/user/casestudy5.rst" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Case Study: Wildlife Photo Archive\n", "\n", "This case study, loosely inspired by a ProvSQL demonstration at EDBT 2025, applies ProvSQL to a database of wildlife photographs annotated by a species-detection model. It demonstrates the `VALUES` clause, `repair_key` and the `mulinput` gate, ranking by probability versus thresholding, `EXCEPT`, common table expressions, and `expected` aggregates.\n", "\n", "## The Scenario\n", "\n", "A naturalist organisation maintains a database of wildlife photographs taken at four field stations in the Scottish Highlands. Each photo has been processed by a species-detection model that draws one or more *bounding boxes* around things it thinks are animals, and for each box reports a list of candidate species with a confidence score. A box can therefore appear with several species candidates (e.g. a partly-occluded shape might score 0.40 as red deer and 0.30 as roe deer); a photo can contain several boxes of the same species (e.g. three deer in a meadow shot).\n", "\n", "Your tasks:\n", "\n", "- find photos that contain specific combinations of species,\n", "- rank results by the probability that the combination is truly present,\n", "- compare probabilistic ranking against naive confidence thresholding,\n", "- exclude photos that contain unwanted species,\n", "- compute expected species counts per photo.\n", "\n", "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The following cells set up the database with all the content this notebook requires; run them first, ideally on a fresh database.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS photo CASCADE;\n", "CREATE TABLE photo (\n", " id integer PRIMARY KEY,\n", " station text NOT NULL,\n", " date date NOT NULL,\n", " filename text NOT NULL\n", ");" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS species CASCADE;\n", "CREATE TABLE species (\n", " id integer PRIMARY KEY,\n", " name text NOT NULL,\n", " category text NOT NULL -- 'mammal', 'bird', 'reptile'\n", ");" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS detection CASCADE;\n", "CREATE TABLE detection (\n", " photo_id integer NOT NULL,\n", " bbox_id integer NOT NULL,\n", " species_id integer NOT NULL,\n", " confidence double precision NOT NULL\n", ");" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "TRUNCATE photo;\n", "COPY photo (id, station, date, filename) FROM stdin;\n", "1\tLoch Torridon\t2024-06-15\tLT-20240615-0001.jpg\n", "2\tLoch Torridon\t2024-06-15\tLT-20240615-0002.jpg\n", "3\tLoch Torridon\t2024-07-02\tLT-20240702-0001.jpg\n", "4\tLoch Torridon\t2024-07-15\tLT-20240715-0001.jpg\n", "5\tLoch Torridon\t2024-08-03\tLT-20240803-0001.jpg\n", "6\tLoch Torridon\t2024-08-12\tLT-20240812-0001.jpg\n", "7\tLoch Torridon\t2024-08-20\tLT-20240820-0001.jpg\n", "8\tLoch Torridon\t2024-09-01\tLT-20240901-0001.jpg\n", "9\tGlen Affric\t2024-06-20\tGA-20240620-0001.jpg\n", "10\tGlen Affric\t2024-07-04\tGA-20240704-0001.jpg\n", "11\tGlen Affric\t2024-07-15\tGA-20240715-0001.jpg\n", "12\tGlen Affric\t2024-08-08\tGA-20240808-0001.jpg\n", "13\tGlen Affric\t2024-08-25\tGA-20240825-0001.jpg\n", "14\tGlen Affric\t2024-09-10\tGA-20240910-0001.jpg\n", "15\tGlen Affric\t2024-09-22\tGA-20240922-0001.jpg\n", "16\tRannoch Moor\t2024-06-10\tRM-20240610-0001.jpg\n", "17\tRannoch Moor\t2024-06-25\tRM-20240625-0001.jpg\n", "18\tRannoch Moor\t2024-07-12\tRM-20240712-0001.jpg\n", "19\tRannoch Moor\t2024-07-30\tRM-20240730-0001.jpg\n", "20\tRannoch Moor\t2024-08-15\tRM-20240815-0001.jpg\n", "21\tRannoch Moor\t2024-09-02\tRM-20240902-0001.jpg\n", "22\tRannoch Moor\t2024-09-15\tRM-20240915-0001.jpg\n", "23\tCairngorms\t2024-06-18\tCG-20240618-0001.jpg\n", "24\tCairngorms\t2024-07-01\tCG-20240701-0001.jpg\n", "25\tCairngorms\t2024-07-20\tCG-20240720-0001.jpg\n", "26\tCairngorms\t2024-08-05\tCG-20240805-0001.jpg\n", "27\tCairngorms\t2024-08-22\tCG-20240822-0001.jpg\n", "28\tCairngorms\t2024-09-08\tCG-20240908-0001.jpg\n", "29\tCairngorms\t2024-09-20\tCG-20240920-0001.jpg\n", "30\tCairngorms\t2024-10-01\tCG-20241001-0001.jpg\n", "\\." ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "TRUNCATE species;\n", "COPY species (id, name, category) FROM stdin;\n", "1\tRed Deer\tmammal\n", "2\tRoe Deer\tmammal\n", "3\tRed Fox\tmammal\n", "4\tMountain Hare\tmammal\n", "5\tPine Marten\tmammal\n", "6\tRed Squirrel\tmammal\n", "7\tGolden Eagle\tbird\n", "8\tOsprey\tbird\n", "9\tCapercaillie\tbird\n", "10\tAdder\treptile\n", "11\tCommon Lizard\treptile\n", "12\tWildcat\tmammal\n", "13\tDomestic Dog\tmammal\n", "\\." ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Detections. Each row is one (bounding-box, species) candidate produced\n", "-- by the classifier; multiple rows for the same (photo_id, bbox_id) pair\n", "-- represent alternative species hypotheses for that single bounding box.\n", "-- Photo 5: low confidences across multiple bboxes, with one ambiguous box\n", "-- (deer or roe deer); motivates probabilistic ranking in Step 7.\n", "-- Photo 9: deer + high-confidence dog (excluded from \"deer, no dogs\").\n", "-- Photo 14: deer + low-confidence near-miss dog (interesting EXCEPT case).\n", "-- Photo 22: ambiguous box plus several clear ones (mixed herd at Rannoch).\n", "TRUNCATE detection;\n", "COPY detection (photo_id, bbox_id, species_id, confidence) FROM stdin;\n", "1\t1\t1\t0.78\n", "1\t2\t1\t0.65\n", "1\t3\t1\t0.71\n", "2\t1\t1\t0.85\n", "2\t2\t3\t0.72\n", "3\t1\t3\t0.68\n", "3\t2\t6\t0.55\n", "4\t1\t5\t0.81\n", "5\t1\t1\t0.40\n", "5\t1\t2\t0.30\n", "5\t2\t1\t0.42\n", "5\t3\t1\t0.38\n", "5\t4\t3\t0.43\n", "5\t5\t3\t0.48\n", "6\t1\t7\t0.91\n", "7\t1\t1\t0.76\n", "7\t2\t1\t0.62\n", "7\t3\t1\t0.59\n", "7\t4\t2\t0.58\n", "8\t1\t9\t0.83\n", "9\t1\t1\t0.79\n", "9\t2\t13\t0.92\n", "10\t1\t1\t0.73\n", "10\t2\t3\t0.66\n", "11\t1\t5\t0.74\n", "11\t2\t6\t0.62\n", "12\t1\t3\t0.81\n", "12\t2\t4\t0.55\n", "13\t1\t12\t0.71\n", "14\t1\t1\t0.84\n", "14\t2\t13\t0.18\n", "15\t1\t7\t0.86\n", "15\t2\t8\t0.79\n", "16\t1\t10\t0.62\n", "16\t2\t11\t0.58\n", "17\t1\t3\t0.74\n", "17\t2\t4\t0.66\n", "18\t1\t1\t0.61\n", "18\t2\t3\t0.55\n", "19\t1\t9\t0.78\n", "20\t1\t7\t0.89\n", "21\t1\t11\t0.53\n", "21\t2\t10\t0.49\n", "22\t1\t1\t0.82\n", "22\t2\t1\t0.71\n", "22\t3\t3\t0.77\n", "22\t3\t1\t0.20\n", "22\t4\t3\t0.64\n", "22\t5\t2\t0.69\n", "22\t6\t2\t0.55\n", "23\t1\t5\t0.71\n", "23\t2\t6\t0.68\n", "24\t1\t1\t0.74\n", "24\t2\t4\t0.62\n", "25\t1\t7\t0.88\n", "25\t2\t9\t0.74\n", "26\t1\t12\t0.65\n", "27\t1\t8\t0.83\n", "28\t1\t1\t0.76\n", "28\t2\t3\t0.69\n", "29\t1\t5\t0.79\n", "30\t1\t6\t0.71\n", "30\t2\t4\t0.59\n", "\\." ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This creates three tables:\n", "\n", "- `photo` – 30 wildlife photographs, each tagged with a station name (Loch Torridon, Glen Affric, Rannoch Moor, or Cairngorms) and a date\n", "- `species` – 13 species across mammals, birds, and reptiles\n", "- `detection` – about 60 model-produced (bounding-box, species) candidate rows, each linking a photo and a bounding-box index to a candidate species with a confidence score; multiple rows for the same (`photo_id`, `bbox_id`) pair represent the classifier's alternative species hypotheses for that single bounding box\n", "\n", "## Step 1: Explore the Database\n", "\n", "Inspect the tables. Note that `detection` is *not* keyed on (`photo_id`, `bbox_id`): a single bounding box can appear in several rows, one per candidate species the classifier considered." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT * FROM photo ORDER BY id LIMIT 5;\n", "SELECT * FROM species ORDER BY id;\n", "SELECT photo_id, bbox_id, species_id, confidence\n", "FROM detection\n", "WHERE photo_id IN (5, 9, 14, 22)\n", "ORDER BY photo_id, bbox_id, confidence DESC;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Enable Provenance and Create a Name Mapping\n", "\n", "Enable provenance tracking on `detection`. Each row receives a UUID circuit token that propagates through any downstream query." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT add_provenance('detection');" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get readable formulas, we want to associate each detection's provenance token with its species name. A *provenance mapping* in ProvSQL is nothing more than a regular table with two columns named `value` and `provenance` (plus, for performance, an index on `provenance`). The convenience function `create_provenance_mapping` builds such a table from one column of a provenance-enabled relation, but nothing prevents us from constructing the table by hand:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS species_mapping;\n", "CREATE TABLE species_mapping AS\n", " SELECT s.name AS value, d.provsql AS provenance\n", " FROM detection d JOIN species s ON s.id = d.species_id;\n", "\n", "SELECT remove_provenance('species_mapping');\n", "CREATE INDEX ON species_mapping(provenance);" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `CREATE TABLE AS` query inherits a `provsql` column from `detection` via ProvSQL's planner hook; `remove_provenance` strips that extra column so only the `(value, provenance)` pair remains. Because the schema is fully under our control, we can populate the table from any expression – combine columns, filter rows, derive computed values – and any semiring-evaluation function (`sr_formula`, `sr_why`…) will happily consume the result.\n", "\n", "## Step 3: Inline Lookup with `VALUES`\n", "\n", "A `VALUES` clause defines an inline relation directly inside a query. ProvSQL's planner hook treats it like any other source: rows it produces have no provenance of their own, but they carry through joins so that the result inherits provenance from the joined provenance-enabled rows.\n", "\n", "Here we use `VALUES` to define a small ad-hoc watchlist: a couple of species we want to look up by hand together with a human-readable label, instead of pulling them from the `species` table. Suppose we are interested in Red Deer (`species_id` 1, the dominant grazer whose density we want to track) and Red Fox (`species_id` 3, a generalist predator), and we want to tag the rows accordingly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT v.label, p.id, p.station, p.date,\n", " sr_formula(provenance(), 'species_mapping') AS formula\n", "FROM (VALUES (1, 'mammal of interest'),\n", " (3, 'predator of interest')) AS v(species_id, label),\n", " detection d, photo p\n", "WHERE d.species_id = v.species_id AND d.photo_id = p.id\n", "ORDER BY p.id, v.label;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each output row carries the provenance of the underlying `detection` row alone: the formula is a single species token, even though the row also references `photo` and `VALUES`. Tables without provenance – including the `VALUES` rows – contribute no tokens.\n", "\n", "## Step 4: Conjunctive Query (Naive)\n", "\n", "Find photos that contain both Red Deer (`species_id` 1) and Red Fox (`species_id` 3) using a self-join on `detection`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, p.station, p.date,\n", " sr_formula(provenance(), 'species_mapping') AS formula\n", "FROM detection d1\n", "JOIN detection d2 ON d1.photo_id = d2.photo_id\n", "JOIN photo p ON p.id = d1.photo_id\n", "WHERE d1.species_id = 1 AND d2.species_id = 3\n", "GROUP BY p.id, p.station, p.date\n", "ORDER BY p.id;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at photo 5: the classifier produced three Red Deer candidate rows (in three different bounding boxes) and two Red Fox candidate rows (in two more boxes). Its formula is the ⊕-sum of all six (deer, fox) pair products – every candidate row is an independent input gate. This matches the structure of the underlying `detection` table but mis-models the data: each bounding box can correspond to *at most one real animal*, so candidate rows that share a `(photo_id, bbox_id)` pair should be mutually exclusive rather than independent. Nothing in the schema enforces that today, and the formula reflects the mismatch.\n", "\n", "## Step 5: Mutually Exclusive Candidates with `repair_key`\n", "\n", "`repair_key` rewrites the provenance so that rows sharing a key become alternatives under a single `mulinput` (multivalued input) gate – i.e. *exactly one* of them is true. Applied with the key `(photo_id, bbox_id)`, every bounding box becomes one mulinput variable whose values are the candidate species the classifier considered for that box.\n", "\n", "`repair_key` takes a single key attribute, so add a synthetic key combining photo and bounding-box index first. `repair_key` reinstalls the `provsql` column itself, so also drop the old mapping (whose tokens are about to become stale):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE species_mapping;\n", "SELECT remove_provenance('detection');\n", "\n", "ALTER TABLE detection ADD COLUMN IF NOT EXISTS photo_bbox text;\n", "UPDATE detection SET photo_bbox = photo_id || '/' || bbox_id;\n", "\n", "SELECT repair_key('detection', 'photo_bbox');\n", "\n", "DROP TABLE IF EXISTS species_mapping;\n", "CREATE TABLE species_mapping AS\n", " SELECT s.name AS value, d.provsql AS provenance\n", " FROM detection d JOIN species s ON s.id = d.species_id;\n", "SELECT remove_provenance('species_mapping');\n", "CREATE INDEX ON species_mapping(provenance);" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Re-running the conjunctive query from Step 4 with `sr_formula` would not be illuminating: mutually exclusive events have no meaningful representation in the symbolic-formula semiring (each `mulinput` just collapses to `𝟙`). To visualize them we use `sr_boolexpr` instead, which renders the underlying Boolean formula with internal variable names and exposes each `mulinput` explicitly:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, sr_boolexpr(provenance()) AS bexpr\n", "FROM detection d1\n", "JOIN detection d2 ON d1.photo_id = d2.photo_id\n", "JOIN photo p ON p.id = d1.photo_id\n", "WHERE d1.species_id = 1 AND d2.species_id = 3 AND p.id IN (2, 5)\n", "GROUP BY p.id\n", "ORDER BY p.id;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each input now appears as a `mulinput` (the `{i=v}[p]` notation denotes \"variable `i` takes value `v` with probability `p`\"). In this query every variable happens to have a single value, so the mutually-exclusive structure is not visible yet. We will see a genuine multi-valued `mulinput` in the next step, where the deer/roe-deer candidates of one bounding box compose under the mutex constraint.\n", "\n", "**Note:**\n", "\n", "> `view_circuit` cannot render `mulinput` gates either and refuses to evaluate. Use `sr_boolexpr` to inspect circuits that contain `mulinput` gates, and `probability_evaluate` to score them.\n", "\n", "## Step 6: Assign Probabilities and Verify Mutual Exclusion\n", "\n", "Each candidate row's confidence becomes the probability that that classifier candidate is the true species for its bounding box:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DO $$ BEGIN\n", " PERFORM set_prob(provenance(), confidence) FROM detection;\n", "END $$;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To see that `repair_key` made a numerical difference, ask: *what is the probability that bounding box 1 of photo 5 corresponds to a deer-like animal* (`species_id` 1 = Red Deer or 2 = Roe Deer)? In the data, that bounding box has both candidate species recorded with confidences 0.40 and 0.30:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT photo_id, bbox_id,\n", " sr_boolexpr(provenance()) AS bexpr,\n", " ROUND(probability_evaluate(provenance(), 'tree-decomposition')::numeric, 4) AS p\n", "FROM detection\n", "WHERE photo_id = 5 AND bbox_id = 1 AND species_id IN (1, 2)\n", "GROUP BY photo_id, bbox_id;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Boolean expression is now `{0=1}[0.400000] ∨ {0=2}[0.300000]`: a single mulinput variable (variable 0) with two mutually exclusive values, value 1 (Red Deer) with probability 0.40 and value 2 (Roe Deer) with probability 0.30. Probability evaluation gives `0.7000`, the sum of the two confidences, since combining mutually exclusive events with ⊕ is just addition. Had we kept the original `add_provenance` setup with each row as an independent input gate, the same query would have given `1 - (1 - 0.40) × (1 - 0.30) = 0.58` instead. The 0.12 gap is the practical effect of telling the engine \"these candidates cannot both be true at once\".\n", "\n", "## Step 7: Probabilistic Ranking vs. Threshold Filtering\n", "\n", "Run the conjunctive query under two ranking strategies. First, by probability that *both* species are truly present:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, p.station, p.date,\n", " ROUND(probability_evaluate(provenance())::numeric, 4) AS prob\n", "FROM detection d1\n", "JOIN detection d2 ON d1.photo_id = d2.photo_id\n", "JOIN photo p ON p.id = d1.photo_id\n", "WHERE d1.species_id = 1 AND d2.species_id = 3\n", "GROUP BY p.id, p.station, p.date\n", "ORDER BY prob DESC, p.id;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Second, by raw confidence threshold (every contributing detection must score at least 0.5):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT DISTINCT p.id, p.station, p.date\n", "FROM detection d1\n", "JOIN detection d2 ON d1.photo_id = d2.photo_id\n", "JOIN photo p ON p.id = d1.photo_id\n", "WHERE d1.species_id = 1 AND d2.species_id = 3\n", " AND d1.confidence >= 0.5 AND d2.confidence >= 0.5\n", "ORDER BY p.id;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Photo 5 is missed by thresholding (every individual candidate there scores below 0.5) but ranks reasonably under `probability_evaluate`, because the weak deer candidates in several boxes plus the weak fox candidates combine into a non-trivial probability that *some* deer and *some* fox are truly there. Conversely, a photo whose top candidates only barely cross 0.5 passes the threshold but ends up low in the probability ranking.\n", "\n", "## Step 8: Absence Constraint with `EXCEPT`\n", "\n", "Find photos that contain a Red Deer but no Domestic Dog (`species_id` 13). `EXCEPT` is implemented in ProvSQL via the ⊖ (monus) operator on the provenance circuit:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, p.station, p.date,\n", " ROUND(probability_evaluate(provenance())::numeric, 4) AS prob\n", "FROM (\n", " SELECT photo_id FROM detection WHERE species_id = 1\n", " EXCEPT\n", " SELECT photo_id FROM detection WHERE species_id = 13\n", ") t\n", "JOIN photo p ON p.id = t.photo_id\n", "GROUP BY p.id, p.station, p.date\n", "ORDER BY prob DESC, p.id;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Photos that contain a dog still appear in the output – `EXCEPT` is *not* a hard filter. Photo 9, with a high-confidence dog detection, ranks lower because the monus discounts strongly. Photo 14, where the dog detection has very low confidence, ranks higher: it is *probably* in the result, but not certainly. ProvSQL preserves both possibilities in the circuit and lets `probability_evaluate` weigh them.\n", "\n", "## Step 9: Multi-Condition Query via a CTE\n", "\n", "Combine Steps 7 and 8: photos with both Red Deer and Red Fox, with no Domestic Dog, ranked by probability. The query has three logical layers (co-occurrence, absence, ranking) and reads naturally as a CTE:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "WITH deer_and_fox AS (\n", " SELECT d1.photo_id\n", " FROM detection d1\n", " JOIN detection d2 ON d1.photo_id = d2.photo_id\n", " WHERE d1.species_id = 1 AND d2.species_id = 3\n", " GROUP BY d1.photo_id\n", "),\n", "no_dogs AS (\n", " SELECT photo_id FROM deer_and_fox\n", " EXCEPT\n", " SELECT photo_id FROM detection WHERE species_id = 13\n", ")\n", "SELECT p.id, p.station, p.date,\n", " ROUND(probability_evaluate(provenance())::numeric, 4) AS prob\n", "FROM no_dogs t\n", "JOIN photo p ON p.id = t.photo_id\n", "ORDER BY prob DESC, p.id;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ProvSQL's planner hook fires on the expanded query: CTEs are inlined and provenance propagates through them transparently. The same answer can be written with nested subqueries; the CTE form is purely a readability choice.\n", "\n", "## Step 10: Expected Species Counts with `expected`\n", "\n", "How many distinct detections do we *expect* to be true positives in each photo?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, p.station,\n", " ROUND(expected(COUNT(*))::numeric, 4) AS exp_detections\n", "FROM detection d\n", "JOIN photo p ON p.id = d.photo_id\n", "GROUP BY p.id, p.station\n", "ORDER BY exp_detections DESC, p.id;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By linearity of expectation, `expected(COUNT(*))` over a group is $`\\sum_i P(\\text{detection}_i \\text{ is true})`$. The same linearity applies to `SUM` aggregates: the expected total confidence mass per photo:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, p.station,\n", " ROUND(expected(SUM(d.confidence))::numeric, 4) AS exp_total_conf\n", "FROM detection d\n", "JOIN photo p ON p.id = d.photo_id\n", "GROUP BY p.id, p.station\n", "ORDER BY exp_total_conf DESC, p.id;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both queries use ProvSQL's `expected` operator, which computes the expected value of a SQL aggregate over the probabilistic database defined by the per-row probabilities set in Step 6. Photos with many high-confidence detections rank highest on both metrics." ] } ] }