{
 "nbformat": 4,
 "nbformat_minor": 5,
 "metadata": {
  "kernelspec": {
   "name": "provsql-studio",
   "display_name": "ProvSQL (SQL)",
   "language": "sql"
  },
  "language_info": {
   "name": "sql"
  },
  "provsql": {
   "scheme": "semiring",
   "database": "cs5",
   "generated_from": "doc/source/user/casestudy5.rst"
  }
 },
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Case Study: Wildlife Photo Archive\n",
    "\n",
    "This case study, loosely inspired by a ProvSQL demonstration at EDBT 2025, applies ProvSQL to a database of wildlife photographs annotated by a species-detection model. It demonstrates the `VALUES` clause, `repair_key` and the `mulinput` gate, ranking by probability versus thresholding, `EXCEPT`, common table expressions, and `expected` aggregates.\n",
    "\n",
    "## The Scenario\n",
    "\n",
    "A naturalist organisation maintains a database of wildlife photographs taken at four field stations in the Scottish Highlands. Each photo has been processed by a species-detection model that draws one or more *bounding boxes* around things it thinks are animals, and for each box reports a list of candidate species with a confidence score. A box can therefore appear with several species candidates (e.g. a partly-occluded shape might score 0.40 as red deer and 0.30 as roe deer); a photo can contain several boxes of the same species (e.g. three deer in a meadow shot).\n",
    "\n",
    "Your tasks:\n",
    "\n",
    "- find photos that contain specific combinations of species,\n",
    "- rank results by the probability that the combination is truly present,\n",
    "- compare probabilistic ranking against naive confidence thresholding,\n",
    "- exclude photos that contain unwanted species,\n",
    "- compute expected species counts per photo.\n",
    "\n",
    "## Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*The following cells set up the database with all the content this notebook requires; run them first, ideally on a fresh database.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "DROP TABLE IF EXISTS photo CASCADE;\n",
    "CREATE TABLE photo (\n",
    "    id integer PRIMARY KEY,\n",
    "    station text NOT NULL,\n",
    "    date date NOT NULL,\n",
    "    filename text NOT NULL\n",
    ");"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "DROP TABLE IF EXISTS species CASCADE;\n",
    "CREATE TABLE species (\n",
    "    id integer PRIMARY KEY,\n",
    "    name text NOT NULL,\n",
    "    category text NOT NULL  -- 'mammal', 'bird', 'reptile'\n",
    ");"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "DROP TABLE IF EXISTS detection CASCADE;\n",
    "CREATE TABLE detection (\n",
    "    photo_id   integer NOT NULL,\n",
    "    bbox_id    integer NOT NULL,\n",
    "    species_id integer NOT NULL,\n",
    "    confidence double precision NOT NULL\n",
    ");"
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "TRUNCATE photo;\n",
    "COPY photo (id, station, date, filename) FROM stdin;\n",
    "1\tLoch Torridon\t2024-06-15\tLT-20240615-0001.jpg\n",
    "2\tLoch Torridon\t2024-06-15\tLT-20240615-0002.jpg\n",
    "3\tLoch Torridon\t2024-07-02\tLT-20240702-0001.jpg\n",
    "4\tLoch Torridon\t2024-07-15\tLT-20240715-0001.jpg\n",
    "5\tLoch Torridon\t2024-08-03\tLT-20240803-0001.jpg\n",
    "6\tLoch Torridon\t2024-08-12\tLT-20240812-0001.jpg\n",
    "7\tLoch Torridon\t2024-08-20\tLT-20240820-0001.jpg\n",
    "8\tLoch Torridon\t2024-09-01\tLT-20240901-0001.jpg\n",
    "9\tGlen Affric\t2024-06-20\tGA-20240620-0001.jpg\n",
    "10\tGlen Affric\t2024-07-04\tGA-20240704-0001.jpg\n",
    "11\tGlen Affric\t2024-07-15\tGA-20240715-0001.jpg\n",
    "12\tGlen Affric\t2024-08-08\tGA-20240808-0001.jpg\n",
    "13\tGlen Affric\t2024-08-25\tGA-20240825-0001.jpg\n",
    "14\tGlen Affric\t2024-09-10\tGA-20240910-0001.jpg\n",
    "15\tGlen Affric\t2024-09-22\tGA-20240922-0001.jpg\n",
    "16\tRannoch Moor\t2024-06-10\tRM-20240610-0001.jpg\n",
    "17\tRannoch Moor\t2024-06-25\tRM-20240625-0001.jpg\n",
    "18\tRannoch Moor\t2024-07-12\tRM-20240712-0001.jpg\n",
    "19\tRannoch Moor\t2024-07-30\tRM-20240730-0001.jpg\n",
    "20\tRannoch Moor\t2024-08-15\tRM-20240815-0001.jpg\n",
    "21\tRannoch Moor\t2024-09-02\tRM-20240902-0001.jpg\n",
    "22\tRannoch Moor\t2024-09-15\tRM-20240915-0001.jpg\n",
    "23\tCairngorms\t2024-06-18\tCG-20240618-0001.jpg\n",
    "24\tCairngorms\t2024-07-01\tCG-20240701-0001.jpg\n",
    "25\tCairngorms\t2024-07-20\tCG-20240720-0001.jpg\n",
    "26\tCairngorms\t2024-08-05\tCG-20240805-0001.jpg\n",
    "27\tCairngorms\t2024-08-22\tCG-20240822-0001.jpg\n",
    "28\tCairngorms\t2024-09-08\tCG-20240908-0001.jpg\n",
    "29\tCairngorms\t2024-09-20\tCG-20240920-0001.jpg\n",
    "30\tCairngorms\t2024-10-01\tCG-20241001-0001.jpg\n",
    "\\."
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "TRUNCATE species;\n",
    "COPY species (id, name, category) FROM stdin;\n",
    "1\tRed Deer\tmammal\n",
    "2\tRoe Deer\tmammal\n",
    "3\tRed Fox\tmammal\n",
    "4\tMountain Hare\tmammal\n",
    "5\tPine Marten\tmammal\n",
    "6\tRed Squirrel\tmammal\n",
    "7\tGolden Eagle\tbird\n",
    "8\tOsprey\tbird\n",
    "9\tCapercaillie\tbird\n",
    "10\tAdder\treptile\n",
    "11\tCommon Lizard\treptile\n",
    "12\tWildcat\tmammal\n",
    "13\tDomestic Dog\tmammal\n",
    "\\."
   ],
   "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "-- Detections.  Each row is one (bounding-box, species) candidate produced\n",
    "-- by the classifier; multiple rows for the same (photo_id, bbox_id) pair\n",
    "-- represent alternative species hypotheses for that single bounding box.\n",
    "-- Photo 5: low confidences across multiple bboxes, with one ambiguous box\n",
    "--          (deer or roe deer); motivates probabilistic ranking in Step 7.\n",
    "-- Photo 9: deer + high-confidence dog (excluded from \"deer, no dogs\").\n",
    "-- Photo 14: deer + low-confidence near-miss dog (interesting EXCEPT case).\n",
    "-- Photo 22: ambiguous box plus several clear ones (mixed herd at Rannoch).\n",
    "TRUNCATE detection;\n",
    "COPY detection (photo_id, bbox_id, species_id, confidence) FROM stdin;\n",
    "1\t1\t1\t0.78\n",
    "1\t2\t1\t0.65\n",
    "1\t3\t1\t0.71\n",
    "2\t1\t1\t0.85\n",
    "2\t2\t3\t0.72\n",
    "3\t1\t3\t0.68\n",
    "3\t2\t6\t0.55\n",
    "4\t1\t5\t0.81\n",
    "5\t1\t1\t0.40\n",
    "5\t1\t2\t0.30\n",
    "5\t2\t1\t0.42\n",
    "5\t3\t1\t0.38\n",
    "5\t4\t3\t0.43\n",
    "5\t5\t3\t0.48\n",
    "6\t1\t7\t0.91\n",
    "7\t1\t1\t0.76\n",
    "7\t2\t1\t0.62\n",
    "7\t3\t1\t0.59\n",
    "7\t4\t2\t0.58\n",
    "8\t1\t9\t0.83\n",
    "9\t1\t1\t0.79\n",
    "9\t2\t13\t0.92\n",
    "10\t1\t1\t0.73\n",
    "10\t2\t3\t0.66\n",
    "11\t1\t5\t0.74\n",
    "11\t2\t6\t0.62\n",
    "12\t1\t3\t0.81\n",
    "12\t2\t4\t0.55\n",
    "13\t1\t12\t0.71\n",
    "14\t1\t1\t0.84\n",
    "14\t2\t13\t0.18\n",
    "15\t1\t7\t0.86\n",
    "15\t2\t8\t0.79\n",
    "16\t1\t10\t0.62\n",
    "16\t2\t11\t0.58\n",
    "17\t1\t3\t0.74\n",
    "17\t2\t4\t0.66\n",
    "18\t1\t1\t0.61\n",
    "18\t2\t3\t0.55\n",
    "19\t1\t9\t0.78\n",
    "20\t1\t7\t0.89\n",
    "21\t1\t11\t0.53\n",
    "21\t2\t10\t0.49\n",
    "22\t1\t1\t0.82\n",
    "22\t2\t1\t0.71\n",
    "22\t3\t3\t0.77\n",
    "22\t3\t1\t0.20\n",
    "22\t4\t3\t0.64\n",
    "22\t5\t2\t0.69\n",
    "22\t6\t2\t0.55\n",
    "23\t1\t5\t0.71\n",
    "23\t2\t6\t0.68\n",
    "24\t1\t1\t0.74\n",
    "24\t2\t4\t0.62\n",
    "25\t1\t7\t0.88\n",
    "25\t2\t9\t0.74\n",
    "26\t1\t12\t0.65\n",
    "27\t1\t8\t0.83\n",
    "28\t1\t1\t0.76\n",
    "28\t2\t3\t0.69\n",
    "29\t1\t5\t0.79\n",
    "30\t1\t6\t0.71\n",
    "30\t2\t4\t0.59\n",
    "\\."
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This creates three tables:\n",
    "\n",
    "- `photo` – 30 wildlife photographs, each tagged with a station name (Loch Torridon, Glen Affric, Rannoch Moor, or Cairngorms) and a date\n",
    "- `species` – 13 species across mammals, birds, and reptiles\n",
    "- `detection` – about 60 model-produced (bounding-box, species) candidate rows, each linking a photo and a bounding-box index to a candidate species with a confidence score; multiple rows for the same (`photo_id`, `bbox_id`) pair represent the classifier's alternative species hypotheses for that single bounding box\n",
    "\n",
    "## Step 1: Explore the Database\n",
    "\n",
    "Inspect the tables. Note that `detection` is *not* keyed on (`photo_id`, `bbox_id`): a single bounding box can appear in several rows, one per candidate species the classifier considered."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT * FROM photo  ORDER BY id LIMIT 5;\n",
    "SELECT * FROM species ORDER BY id;\n",
    "SELECT photo_id, bbox_id, species_id, confidence\n",
    "FROM detection\n",
    "WHERE photo_id IN (5, 9, 14, 22)\n",
    "ORDER BY photo_id, bbox_id, confidence DESC;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Enable Provenance and Create a Name Mapping\n",
    "\n",
    "Enable provenance tracking on `detection`. Each row receives a UUID circuit token that propagates through any downstream query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT add_provenance('detection');"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get readable formulas, we want to associate each detection's provenance token with its species name. A *provenance mapping* in ProvSQL is nothing more than a regular table with two columns named `value` and `provenance` (plus, for performance, an index on `provenance`). The convenience function `create_provenance_mapping` builds such a table from one column of a provenance-enabled relation, but nothing prevents us from constructing the table by hand:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "DROP TABLE IF EXISTS species_mapping;\n",
    "CREATE TABLE species_mapping AS\n",
    "  SELECT s.name AS value, d.provsql AS provenance\n",
    "  FROM detection d JOIN species s ON s.id = d.species_id;\n",
    "\n",
    "SELECT remove_provenance('species_mapping');\n",
    "CREATE INDEX ON species_mapping(provenance);"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `CREATE TABLE AS` query inherits a `provsql` column from `detection` via ProvSQL's planner hook; `remove_provenance` strips that extra column so only the `(value, provenance)` pair remains. Because the schema is fully under our control, we can populate the table from any expression – combine columns, filter rows, derive computed values – and any semiring-evaluation function (`sr_formula`, `sr_why`…) will happily consume the result.\n",
    "\n",
    "## Step 3: Inline Lookup with `VALUES`\n",
    "\n",
    "A `VALUES` clause defines an inline relation directly inside a query. ProvSQL's planner hook treats it like any other source: rows it produces have no provenance of their own, but they carry through joins so that the result inherits provenance from the joined provenance-enabled rows.\n",
    "\n",
    "Here we use `VALUES` to define a small ad-hoc watchlist: a couple of species we want to look up by hand together with a human-readable label, instead of pulling them from the `species` table. Suppose we are interested in Red Deer (`species_id` 1, the dominant grazer whose density we want to track) and Red Fox (`species_id` 3, a generalist predator), and we want to tag the rows accordingly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT v.label, p.id, p.station, p.date,\n",
    "       sr_formula(provenance(), 'species_mapping') AS formula\n",
    "FROM (VALUES (1, 'mammal of interest'),\n",
    "             (3, 'predator of interest')) AS v(species_id, label),\n",
    "     detection d, photo p\n",
    "WHERE d.species_id = v.species_id AND d.photo_id = p.id\n",
    "ORDER BY p.id, v.label;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each output row carries the provenance of the underlying `detection` row alone: the formula is a single species token, even though the row also references `photo` and `VALUES`. Tables without provenance – including the `VALUES` rows – contribute no tokens.\n",
    "\n",
    "## Step 4: Conjunctive Query (Naive)\n",
    "\n",
    "Find photos that contain both Red Deer (`species_id` 1) and Red Fox (`species_id` 3) using a self-join on `detection`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT p.id, p.station, p.date,\n",
    "       sr_formula(provenance(), 'species_mapping') AS formula\n",
    "FROM detection d1\n",
    "JOIN detection d2 ON d1.photo_id = d2.photo_id\n",
    "JOIN photo p ON p.id = d1.photo_id\n",
    "WHERE d1.species_id = 1 AND d2.species_id = 3\n",
    "GROUP BY p.id, p.station, p.date\n",
    "ORDER BY p.id;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Look at photo 5: the classifier produced three Red Deer candidate rows (in three different bounding boxes) and two Red Fox candidate rows (in two more boxes). Its formula is the ⊕-sum of all six (deer, fox) pair products – every candidate row is an independent input gate. This matches the structure of the underlying `detection` table but mis-models the data: each bounding box can correspond to *at most one real animal*, so candidate rows that share a `(photo_id, bbox_id)` pair should be mutually exclusive rather than independent. Nothing in the schema enforces that today, and the formula reflects the mismatch.\n",
    "\n",
    "## Step 5: Mutually Exclusive Candidates with `repair_key`\n",
    "\n",
    "`repair_key` rewrites the provenance so that rows sharing a key become alternatives under a single `mulinput` (multivalued input) gate – i.e. *exactly one* of them is true. Applied with the key `(photo_id, bbox_id)`, every bounding box becomes one mulinput variable whose values are the candidate species the classifier considered for that box.\n",
    "\n",
    "`repair_key` takes a single key attribute, so add a synthetic key combining photo and bounding-box index first. `repair_key` reinstalls the `provsql` column itself, so also drop the old mapping (whose tokens are about to become stale):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "DROP TABLE species_mapping;\n",
    "SELECT remove_provenance('detection');\n",
    "\n",
    "ALTER TABLE detection ADD COLUMN IF NOT EXISTS photo_bbox text;\n",
    "UPDATE detection SET photo_bbox = photo_id || '/' || bbox_id;\n",
    "\n",
    "SELECT repair_key('detection', 'photo_bbox');\n",
    "\n",
    "DROP TABLE IF EXISTS species_mapping;\n",
    "CREATE TABLE species_mapping AS\n",
    "  SELECT s.name AS value, d.provsql AS provenance\n",
    "  FROM detection d JOIN species s ON s.id = d.species_id;\n",
    "SELECT remove_provenance('species_mapping');\n",
    "CREATE INDEX ON species_mapping(provenance);"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Re-running the conjunctive query from Step 4 with `sr_formula` would not be illuminating: mutually exclusive events have no meaningful representation in the symbolic-formula semiring (each `mulinput` just collapses to `𝟙`). To visualize them we use `sr_boolexpr` instead, which renders the underlying Boolean formula with internal variable names and exposes each `mulinput` explicitly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT p.id, sr_boolexpr(provenance()) AS bexpr\n",
    "FROM detection d1\n",
    "JOIN detection d2 ON d1.photo_id = d2.photo_id\n",
    "JOIN photo p ON p.id = d1.photo_id\n",
    "WHERE d1.species_id = 1 AND d2.species_id = 3 AND p.id IN (2, 5)\n",
    "GROUP BY p.id\n",
    "ORDER BY p.id;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each input now appears as a `mulinput` (the `{i=v}[p]` notation denotes \"variable `i` takes value `v` with probability `p`\"). In this query every variable happens to have a single value, so the mutually-exclusive structure is not visible yet. We will see a genuine multi-valued `mulinput` in the next step, where the deer/roe-deer candidates of one bounding box compose under the mutex constraint.\n",
    "\n",
    "**Note:**\n",
    "\n",
    "> `view_circuit` cannot render `mulinput` gates either and refuses to evaluate. Use `sr_boolexpr` to inspect circuits that contain `mulinput` gates, and `probability_evaluate` to score them.\n",
    "\n",
    "## Step 6: Assign Probabilities and Verify Mutual Exclusion\n",
    "\n",
    "Each candidate row's confidence becomes the probability that that classifier candidate is the true species for its bounding box:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "DO $$ BEGIN\n",
    "  PERFORM set_prob(provenance(), confidence) FROM detection;\n",
    "END $$;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see that `repair_key` made a numerical difference, ask: *what is the probability that bounding box 1 of photo 5 corresponds to a deer-like animal* (`species_id` 1 = Red Deer or 2 = Roe Deer)? In the data, that bounding box has both candidate species recorded with confidences 0.40 and 0.30:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT photo_id, bbox_id,\n",
    "    sr_boolexpr(provenance()) AS bexpr,\n",
    "    ROUND(probability_evaluate(provenance(), 'tree-decomposition')::numeric, 4) AS p\n",
    "FROM detection\n",
    "WHERE photo_id = 5 AND bbox_id = 1 AND species_id IN (1, 2)\n",
    "GROUP BY photo_id, bbox_id;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Boolean expression is now `{0=1}[0.400000] ∨ {0=2}[0.300000]`: a single mulinput variable (variable 0) with two mutually exclusive values, value 1 (Red Deer) with probability 0.40 and value 2 (Roe Deer) with probability 0.30. Probability evaluation gives `0.7000`, the sum of the two confidences, since combining mutually exclusive events with ⊕ is just addition. Had we kept the original `add_provenance` setup with each row as an independent input gate, the same query would have given `1 - (1 - 0.40) × (1 - 0.30) = 0.58` instead. The 0.12 gap is the practical effect of telling the engine \"these candidates cannot both be true at once\".\n",
    "\n",
    "## Step 7: Probabilistic Ranking vs. Threshold Filtering\n",
    "\n",
    "Run the conjunctive query under two ranking strategies. First, by probability that *both* species are truly present:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT p.id, p.station, p.date,\n",
    "       ROUND(probability_evaluate(provenance())::numeric, 4) AS prob\n",
    "FROM detection d1\n",
    "JOIN detection d2 ON d1.photo_id = d2.photo_id\n",
    "JOIN photo p ON p.id = d1.photo_id\n",
    "WHERE d1.species_id = 1 AND d2.species_id = 3\n",
    "GROUP BY p.id, p.station, p.date\n",
    "ORDER BY prob DESC, p.id;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Second, by raw confidence threshold (every contributing detection must score at least 0.5):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT DISTINCT p.id, p.station, p.date\n",
    "FROM detection d1\n",
    "JOIN detection d2 ON d1.photo_id = d2.photo_id\n",
    "JOIN photo p ON p.id = d1.photo_id\n",
    "WHERE d1.species_id = 1 AND d2.species_id = 3\n",
    "  AND d1.confidence >= 0.5 AND d2.confidence >= 0.5\n",
    "ORDER BY p.id;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Photo 5 is missed by thresholding (every individual candidate there scores below 0.5) but ranks reasonably under `probability_evaluate`, because the weak deer candidates in several boxes plus the weak fox candidates combine into a non-trivial probability that *some* deer and *some* fox are truly there. Conversely, a photo whose top candidates only barely cross 0.5 passes the threshold but ends up low in the probability ranking.\n",
    "\n",
    "## Step 8: Absence Constraint with `EXCEPT`\n",
    "\n",
    "Find photos that contain a Red Deer but no Domestic Dog (`species_id` 13). `EXCEPT` is implemented in ProvSQL via the ⊖ (monus) operator on the provenance circuit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT p.id, p.station, p.date,\n",
    "       ROUND(probability_evaluate(provenance())::numeric, 4) AS prob\n",
    "FROM (\n",
    "    SELECT photo_id FROM detection WHERE species_id = 1\n",
    "  EXCEPT\n",
    "    SELECT photo_id FROM detection WHERE species_id = 13\n",
    ") t\n",
    "JOIN photo p ON p.id = t.photo_id\n",
    "GROUP BY p.id, p.station, p.date\n",
    "ORDER BY prob DESC, p.id;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Photos that contain a dog still appear in the output – `EXCEPT` is *not* a hard filter. Photo 9, with a high-confidence dog detection, ranks lower because the monus discounts strongly. Photo 14, where the dog detection has very low confidence, ranks higher: it is *probably* in the result, but not certainly. ProvSQL preserves both possibilities in the circuit and lets `probability_evaluate` weigh them.\n",
    "\n",
    "## Step 9: Multi-Condition Query via a CTE\n",
    "\n",
    "Combine Steps 7 and 8: photos with both Red Deer and Red Fox, with no Domestic Dog, ranked by probability. The query has three logical layers (co-occurrence, absence, ranking) and reads naturally as a CTE:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "WITH deer_and_fox AS (\n",
    "  SELECT d1.photo_id\n",
    "  FROM detection d1\n",
    "  JOIN detection d2 ON d1.photo_id = d2.photo_id\n",
    "  WHERE d1.species_id = 1 AND d2.species_id = 3\n",
    "  GROUP BY d1.photo_id\n",
    "),\n",
    "no_dogs AS (\n",
    "  SELECT photo_id FROM deer_and_fox\n",
    "  EXCEPT\n",
    "  SELECT photo_id FROM detection WHERE species_id = 13\n",
    ")\n",
    "SELECT p.id, p.station, p.date,\n",
    "       ROUND(probability_evaluate(provenance())::numeric, 4) AS prob\n",
    "FROM no_dogs t\n",
    "JOIN photo p ON p.id = t.photo_id\n",
    "ORDER BY prob DESC, p.id;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ProvSQL's planner hook fires on the expanded query: CTEs are inlined and provenance propagates through them transparently. The same answer can be written with nested subqueries; the CTE form is purely a readability choice.\n",
    "\n",
    "## Step 10: Expected Species Counts with `expected`\n",
    "\n",
    "How many distinct detections do we *expect* to be true positives in each photo?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT p.id, p.station,\n",
    "       ROUND(expected(COUNT(*))::numeric, 4) AS exp_detections\n",
    "FROM detection d\n",
    "JOIN photo p ON p.id = d.photo_id\n",
    "GROUP BY p.id, p.station\n",
    "ORDER BY exp_detections DESC, p.id;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By linearity of expectation, `expected(COUNT(*))` over a group is $`\\sum_i P(\\text{detection}_i \\text{ is true})`$. The same linearity applies to `SUM` aggregates: the expected total confidence mass per photo:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "provsql": {}
   },
   "source": [
    "SELECT p.id, p.station,\n",
    "       ROUND(expected(SUM(d.confidence))::numeric, 4) AS exp_total_conf\n",
    "FROM detection d\n",
    "JOIN photo p ON p.id = d.photo_id\n",
    "GROUP BY p.id, p.station\n",
    "ORDER BY exp_total_conf DESC, p.id;"
   ],
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both queries use ProvSQL's `expected` operator, which computes the expected value of a SQL aggregate over the probabilistic database defined by the per-row probabilities set in Step 6. Photos with many high-confidence detections rank highest on both metrics."
   ]
  }
 ]
}