{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "provsql-studio", "display_name": "ProvSQL (SQL)", "language": "sql" }, "language_info": { "name": "sql" }, "provsql": { "scheme": "semiring", "database": "cs2", "generated_from": "doc/source/user/casestudy2.rst" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Case Study: Open Science Database\n", "\n", "This case study introduces a broader set of ProvSQL features through a realistic scientific literature analysis scenario.\n", "\n", "## The Scenario\n", "\n", "**Warning:**\n", "\n", "> All studies, findings, and reliability scores in this case study are **entirely fictional** and created solely to illustrate ProvSQL features. They do not correspond to real published research and convey no medical or scientific knowledge.\n", "\n", "You are building an evidence-synthesis tool for biomedical research. You have a small database of published studies, each with a study type (case report, observational, RCT, or meta-analysis) and a reliability score. Each study reports one or more *findings*: an exposure (e.g. Coffee, Exercise), an outcome (e.g. Cardiovascular Disease), and an observed effect (beneficial, harmful, or neutral).\n", "\n", "Your tasks:\n", "\n", "- identify single-source vs. replicated claims,\n", "- detect and handle contradictory findings,\n", "- rank findings by the strength of evidence behind them,\n", "- compute the probability that a claim is supported by the available studies.\n", "\n", "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The following cells set up the database with all the content this notebook requires; run them first, ideally on a fresh database.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TYPE IF EXISTS study_quality CASCADE;\n", "CREATE TYPE study_quality AS ENUM ('no_evidence', 'case_report', 'observational', 'rct', 'meta_analysis', 'perfect_evidence');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS study CASCADE;\n", "CREATE TABLE study (\n", " id integer PRIMARY KEY,\n", " title text NOT NULL,\n", " year integer,\n", " study_type study_quality NOT NULL,\n", " reliability double precision NOT NULL\n", ");\n", "DROP TABLE IF EXISTS exposure CASCADE;\n", "CREATE TABLE exposure (id integer PRIMARY KEY, name text NOT NULL);\n", "DROP TABLE IF EXISTS outcome CASCADE;\n", "CREATE TABLE outcome (id integer PRIMARY KEY, name text NOT NULL);\n", "DROP TABLE IF EXISTS finding CASCADE;\n", "CREATE TABLE finding (\n", " id integer PRIMARY KEY,\n", " study_id integer NOT NULL,\n", " exposure_id integer NOT NULL,\n", " outcome_id integer NOT NULL,\n", " effect text NOT NULL,\n", " effect_size double precision\n", ");" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "TRUNCATE study;\n", "COPY study (id, title, year, study_type, reliability) FROM stdin;\n", "1\tSmith2018\t2018\trct\t0.92\n", "2\tJohnson2020\t2020\tmeta_analysis\t0.98\n", "3\tChen2019\t2019\tobservational\t0.76\n", "4\tWilliams2021\t2021\trct\t0.88\n", "5\tGarcia2017\t2017\tobservational\t0.65\n", "6\tBrown2022\t2022\tmeta_analysis\t0.95\n", "7\tMartinez2020\t2020\tcase_report\t0.45\n", "8\tPark2021\t2021\trct\t0.85\n", "\\." ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "TRUNCATE exposure;\n", "COPY exposure (id, name) FROM stdin;\n", "1\tCoffee\n", "2\tExercise\n", "3\tRed Meat\n", "4\tAspirin\n", "5\tOmega-3\n", "6\tAlcohol\n", "7\tProcessed Food\n", "\\." ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "TRUNCATE outcome;\n", "COPY outcome (id, name) FROM stdin;\n", "1\tCardiovascular Disease\n", "2\tType 2 Diabetes\n", "3\tInflammation\n", "4\tColorectal Cancer\n", "5\tCognitive Decline\n", "\\." ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- 25 findings; contradictions on Coffee→CVD and Alcohol→CVD;\n", "-- single-study claims on Aspirin→Cognitive Decline, Omega-3→CVD, Exercise→Inflammation.\n", "TRUNCATE finding;\n", "COPY finding (id, study_id, exposure_id, outcome_id, effect, effect_size) FROM stdin;\n", "1\t5\t1\t1\tharmful\t1.3\n", "2\t6\t1\t1\tbeneficial\t0.85\n", "3\t3\t1\t1\tneutral\t1.0\n", "4\t1\t2\t1\tbeneficial\t0.60\n", "5\t2\t2\t1\tbeneficial\t0.55\n", "6\t4\t2\t1\tbeneficial\t0.65\n", "7\t2\t3\t4\tharmful\t1.50\n", "8\t3\t3\t4\tharmful\t1.40\n", "9\t1\t4\t1\tbeneficial\t0.75\n", "10\t4\t4\t1\tbeneficial\t0.78\n", "11\t6\t5\t3\tbeneficial\t0.70\n", "12\t8\t5\t3\tbeneficial\t0.72\n", "13\t5\t6\t1\tbeneficial\t0.90\n", "14\t7\t6\t1\tharmful\t1.20\n", "15\t2\t7\t2\tharmful\t1.80\n", "16\t3\t7\t2\tharmful\t1.60\n", "17\t6\t1\t5\tbeneficial\t0.80\n", "18\t8\t1\t5\tbeneficial\t0.75\n", "19\t3\t3\t3\tharmful\t1.30\n", "20\t7\t3\t3\tharmful\t1.25\n", "21\t4\t6\t2\tharmful\t1.40\n", "22\t5\t6\t2\tharmful\t1.35\n", "23\t8\t4\t5\tbeneficial\t0.60\n", "24\t6\t5\t1\tbeneficial\t0.80\n", "25\t1\t2\t3\tbeneficial\t0.70\n", "\\." ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This creates four tables:\n", "\n", "- `study` – 8 published studies with type and reliability score\n", "- `exposure` – 7 exposures (Coffee, Exercise, Red Meat…)\n", "- `outcome` – 5 health outcomes (Cardiovascular Disease…)\n", "- `finding` – 25 study findings linking exposures to outcomes\n", "\n", "## Step 1: Explore the Database\n", "\n", "Familiarise yourself with the data. The `study_type` column uses a PostgreSQL `ENUM` ordered by evidence quality: `no_evidence < case_report < observational < rct < meta_analysis < perfect_evidence`, where `no_evidence` is the semiring 𝟘 (no derivation possible) and `perfect_evidence` is the semiring 𝟙 (neutral for ⊗=MIN: does not degrade quality chains).\n", "\n", "## Step 2: Enable Provenance and Join with Lookup Tables\n", "\n", "Enable provenance tracking on `finding`, the base fact table. Each row in `finding` receives a unique UUID circuit token that will be carried through any downstream query." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT add_provenance('finding');" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now build a view `f` by joining `finding` with the three lookup tables. ProvSQL transparently propagates each `finding` row's token through the join, so every row in `f` carries the provenance token of the `finding` row it came from. Define `f` using the following columns: `study` (the study title), `study_type`, `reliability`, `exposure` (the exposure name), `outcome` (the outcome name), and `effect`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "CREATE OR REPLACE VIEW f AS\n", " SELECT study.title AS study,\n", " study.study_type,\n", " study.reliability,\n", " exposure.name AS exposure,\n", " outcome.name AS outcome,\n", " finding.effect\n", " FROM finding\n", " JOIN study ON finding.study_id = study.id\n", " JOIN exposure ON finding.exposure_id = exposure.id\n", " JOIN outcome ON finding.outcome_id = outcome.id;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:**\n", "\n", "> Querying a view that references a provenance-enabled table automatically exposes the provenance column: ProvSQL's planner hook fires on the expanded query and propagates the `finding` token through the join. Any query on `f` therefore carries full provenance, even though provenance was never explicitly added to `f` itself.\n", "\n", "## Step 3: Create a Provenance Mapping\n", "\n", "Create a mapping from provenance tokens (which trace back to `finding` rows) to study titles, using the `study` column of `f`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS study_mapping;\n", "SELECT create_provenance_mapping('study_mapping', 'f', 'study');" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4: Identify Single-Source Claims\n", "\n", "Some claims rest on a single study. Use `sr_formula` with `study_mapping` to display the symbolic provenance formula for a few findings of interest: the (Coffee, Cardiovascular Disease, harmful), (Alcohol, Cardiovascular Disease, beneficial), and (Exercise, Inflammation, beneficial) triples.\n", "\n", "**Note:**\n", "\n", "> Use `GROUP BY` (not `SELECT DISTINCT`) when applying a semiring evaluation function. With `SELECT DISTINCT`, the computed formula becomes part of the distinct key, so rows with the same (exposure, outcome, effect) but different single-study formulas are never collapsed – each keeps its own singleton formula. With `GROUP BY`, ProvSQL ⊕-combines all provenance tokens in the group first, and then applies `sr_formula` once on the combined token." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT exposure, outcome, effect,\n", " sr_formula(provenance(), 'study_mapping') AS formula\n", "FROM f\n", "WHERE (exposure = 'Coffee' AND outcome = 'Cardiovascular Disease' AND effect = 'harmful')\n", " OR (exposure = 'Alcohol' AND outcome = 'Cardiovascular Disease' AND effect = 'beneficial')\n", " OR (exposure = 'Exercise' AND outcome = 'Inflammation' AND effect = 'beneficial')\n", "GROUP BY exposure, outcome, effect\n", "ORDER BY exposure, outcome, effect;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe that each formula is just a single study name: these findings have only one source.\n", "\n", "## Step 5: Why-Provenance for Replicated Findings\n", "\n", "For replicated findings, the *why-provenance* returns a set of *witness sets*. Each witness set is a minimal collection of studies that together (⊗) suffice to derive the finding; the outer set collects all such independent alternatives (⊕). Use `sr_why` on `f` with `GROUP BY` for the (Exercise, Cardiovascular Disease, beneficial) and (Aspirin, Cardiovascular Disease, beneficial) pairs." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT exposure, outcome, effect,\n", " sr_why(provenance(), 'study_mapping') AS witnesses\n", "FROM f\n", "WHERE (exposure = 'Exercise' AND outcome = 'Cardiovascular Disease')\n", " OR (exposure = 'Aspirin' AND outcome = 'Cardiovascular Disease')\n", "GROUP BY exposure, outcome, effect\n", "ORDER BY exposure, outcome, effect;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each inner set in `witnesses` is a minimal group of studies that together (⊗) derive the finding; the multiple inner sets are independent alternatives (⊕) – any one of them alone suffices.\n", "\n", "## Step 6: Evidence Grade Semiring\n", "\n", "Define a custom *evidence grade* semiring over `study_quality` (see the section on custom semirings for a full description of the mechanism):\n", "\n", "- **⊕ = MAX** (best quality among alternative derivations)\n", "- **⊗ = MIN** (weakest quality in a chain of derivations)\n", "\n", "This answers: *\"What is the best study type supporting this finding?\"*\n", "\n", "Implement this semiring using PostgreSQL aggregate functions and `provenance_evaluate`, create a `quality_mapping` from `f` over the `study_type` column, and compute the evidence grade for every (exposure, outcome, effect) triple." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "CREATE OR REPLACE FUNCTION quality_plus_state(state study_quality, q study_quality)\n", " RETURNS study_quality AS $$\n", " SELECT GREATEST(state, q)\n", "$$ LANGUAGE SQL IMMUTABLE;\n", "\n", "CREATE OR REPLACE FUNCTION quality_times_state(state study_quality, q study_quality)\n", " RETURNS study_quality AS $$\n", " SELECT LEAST(state, q)\n", "$$ LANGUAGE SQL IMMUTABLE;\n", "\n", "DROP AGGREGATE IF EXISTS quality_plus(study_quality);\n", "CREATE AGGREGATE quality_plus(study_quality) (\n", " sfunc = quality_plus_state, stype = study_quality, initcond = 'no_evidence'\n", ");\n", "DROP AGGREGATE IF EXISTS quality_times(study_quality);\n", "CREATE AGGREGATE quality_times(study_quality) (\n", " sfunc = quality_times_state, stype = study_quality, initcond = 'perfect_evidence'\n", ");\n", "\n", "CREATE OR REPLACE FUNCTION evidence_grade(token UUID, token2value regclass)\n", " RETURNS study_quality AS $$\n", "BEGIN\n", " RETURN provenance_evaluate(\n", " token, token2value,\n", " 'perfect_evidence'::study_quality,\n", " 'quality_plus', 'quality_times'\n", " );\n", "END\n", "$$ LANGUAGE plpgsql;\n", "\n", "DROP TABLE IF EXISTS quality_mapping;\n", "SELECT create_provenance_mapping('quality_mapping', 'f', 'study_type');\n", "\n", "SELECT exposure, outcome, effect,\n", " evidence_grade(provenance(), 'quality_mapping') AS grade\n", "FROM f\n", "GROUP BY exposure, outcome, effect\n", "ORDER BY exposure, outcome, effect;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:**\n", "\n", "> Use `GROUP BY` (not `SELECT DISTINCT`) when combining an aggregate function with provenance evaluation over a grouped result. `GROUP BY` collapses each group into a single provenance token (via ⊕), whereas `SELECT DISTINCT` would include the computed value in the distinct scope and produce spurious duplicates.\n", "\n", "## Step 7: Where-Provenance\n", "\n", "Where-provenance tracks which *column* of which *table* each value in a result came from. Enable it and query `f` for the Smith2018/Exercise/CVD finding:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SET provsql.where_provenance = on;\n", "\n", "SELECT study, study_type, exposure, outcome, effect,\n", " where_provenance(provenance()) AS source\n", "FROM f\n", "WHERE exposure = 'Exercise' AND outcome = 'Cardiovascular Disease'\n", " AND study = 'Smith2018';\n", "\n", "SET provsql.where_provenance = off;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each entry in `source` takes the form `[table:token:column]`, where `token` is the provenance UUID of the source row and `column` is its position in the table. Only `effect` is tracked, appearing as `[finding:〈token〉:5]`; the remaining columns (`study`, `study_type`, `exposure`, `outcome`) appear as empty `[]` – they originate from `study`, `exposure`, and `outcome` tables that have no provenance enabled.\n", "\n", "## Step 8: Where-Provenance on the Base Table\n", "\n", "To see full column-level tracking, query `finding` directly. Selecting `finding`-column references ensures every output value originates from the provenance-enabled table." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SET provsql.where_provenance = on;\n", "\n", "SELECT finding.study_id, finding.exposure_id, finding.outcome_id, finding.effect,\n", " where_provenance(provenance()) AS source\n", "FROM finding\n", "JOIN study ON finding.study_id = study.id AND study.title = 'Smith2018'\n", "JOIN exposure ON finding.exposure_id = exposure.id AND exposure.name = 'Exercise'\n", "JOIN outcome ON finding.outcome_id = outcome.id AND outcome.name = 'Cardiovascular Disease'\n", "WHERE finding.effect = 'beneficial';\n", "\n", "SET provsql.where_provenance = off;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now every output column traces back to its source column in `finding`: `study_id` → `[finding:〈token〉:2]`, `exposure_id` → `[finding:〈token〉:3]`, `outcome_id` → `[finding:〈token〉:4]`, `effect` → `[finding:〈token〉:5]`. The trailing `[]` is the untracked `source` column itself.\n", "\n", "## Step 9: Assign Probabilities\n", "\n", "Assign each row of `f` its study's reliability score as a probability:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DO $$ BEGIN\n", " PERFORM set_prob(provenance(), reliability) FROM f;\n", "END $$;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now compute the probability that at least one study supports each of the following three findings: (Exercise, Cardiovascular Disease, beneficial), (Coffee, Cardiovascular Disease, harmful), and (Aspirin, Cognitive Decline, beneficial)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT exposure, outcome, effect,\n", " ROUND(probability_evaluate(provenance())::numeric, 4) AS prob\n", "FROM f\n", "WHERE (exposure = 'Exercise' AND outcome = 'Cardiovascular Disease' AND effect = 'beneficial')\n", " OR (exposure = 'Coffee' AND outcome = 'Cardiovascular Disease' AND effect = 'harmful')\n", " OR (exposure = 'Aspirin' AND outcome = 'Cognitive Decline' AND effect = 'beneficial')\n", "GROUP BY exposure, outcome, effect\n", "ORDER BY exposure, outcome, effect;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exercise→CVD→beneficial achieves 0.9998 (three independent studies with high reliability). Aspirin→Cognitive Decline→beneficial scores only 0.8500 (one study with reliability 0.85).\n", "\n", "## Step 10: Build the Replication View\n", "\n", "A finding is considered *replicated* if at least two independent studies report it. Define the `f_replicated` view, which groups findings by (exposure, outcome, effect) and applies the replication threshold via `HAVING COUNT(*) >= 2`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "CREATE OR REPLACE VIEW f_replicated AS\n", "SELECT exposure, outcome, effect FROM f\n", "GROUP BY exposure, outcome, effect\n", "HAVING COUNT(*) >= 2;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:**\n", "\n", "> With ProvSQL, `HAVING` does not silently drop groups that fail the threshold. Instead, those groups keep a provenance token that evaluates to the semiring zero `𝟘` in any semiring evaluation, so they remain in the output but are correctly handled by any subsequent semiring evaluation or probability computation.\n", "\n", "## Step 11: Inspect Replication with `sr_counting`\n", "\n", "To inspect the provenance semantics of `f_replicated`, first add an integer column `cnt` to `finding` (all values `1`) and create a `count_mapping`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "ALTER TABLE finding ADD COLUMN IF NOT EXISTS cnt int DEFAULT 1;\n", "DROP TABLE IF EXISTS count_mapping;\n", "SELECT create_provenance_mapping('count_mapping', 'finding', 'cnt');" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now query `f_replicated` using `sr_counting` to display, for each (exposure, outcome, effect) triple, whether its provenance token is zero or non-zero.\n", "\n", "**Note:**\n", "\n", "> `sr_counting` is a *provenance* semiring evaluation, independent of the SQL `COUNT(*)` aggregate. `COUNT(*)` is standard SQL that drives the `HAVING` threshold; `sr_counting` evaluates the resulting provenance circuit under the counting semiring, assigning each base finding the value from `count_mapping` (`1` per row). The two happen to share the word \"count\" but serve completely different roles." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT exposure, outcome, effect,\n", " sr_counting(provenance(), 'count_mapping') AS replicated\n", "FROM f_replicated\n", "ORDER BY exposure, outcome, effect;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe that single-study findings (Aspirin→Cognitive Decline, etc.) receive a `replicated` value of `0`, while findings supported by two or more studies receive `1`.\n", "\n", "## Step 12: Probability of Replication\n", "\n", "Now use `f_replicated` to compute, for each (exposure, outcome, effect) triple, the probability that the finding is *replicated* – i.e. supported by at least two independent studies." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT exposure, outcome, effect,\n", " ROUND(probability_evaluate(provenance())::numeric, 4) AS prob\n", "FROM f_replicated\n", "ORDER BY exposure, outcome, effect;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Single-study findings (Aspirin→Cognitive Decline, etc.) now return 0.0000: the `HAVING COUNT(*) >= 2` gate produces `𝟘` for groups with only one row, which `probability_evaluate` correctly maps to probability 0. Exercise→CVD→beneficial drops from 0.9998 to 0.9868, reflecting that now *at least two* of the three studies must agree.\n", "\n", "## Step 13: Shapley Values\n", "\n", "Shapley values measure each study's marginal contribution to the *replication* probability of a finding. Because probabilities are set on the input tuples, `shapley` computes *expected* Shapley values in the probabilistic sense (see the chapter on Shapley values). A key property of expected Shapley values is that they sum to the probability of the query result – here, the replication probability computed in Step 12.\n", "\n", "Compute expected Shapley values for Exercise→CVD→beneficial, using `f_replicated` as the target and individual `f` rows as the variables.\n", "\n", "**Hint:**\n", "\n", "> `shapley` takes two provenance tokens: the combined token of the query result (the \"target\", from `f_replicated`) and the individual token of each input row (the \"variable\", from `f`). Capture both in subqueries and cross-join them." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT fin.study,\n", " ROUND(shapley(target.prov, fin.prov)::numeric, 4) AS sv\n", "FROM (\n", " SELECT provenance() AS prov\n", " FROM f_replicated\n", " WHERE exposure = 'Exercise' AND outcome = 'Cardiovascular Disease'\n", " AND effect = 'beneficial'\n", ") target,\n", "(\n", " SELECT study, provenance() AS prov\n", " FROM f\n", " WHERE exposure = 'Exercise' AND outcome = 'Cardiovascular Disease'\n", " AND effect = 'beneficial'\n", ") fin\n", "ORDER BY sv DESC, study;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Johnson2020 (meta-analysis, reliability 0.98) has the highest Shapley value (0.3531), reflecting its dominant contribution to the replication probability. Smith2018 (RCT, 0.92) scores 0.3267 and Williams2021 (RCT, 0.88) scores 0.3071. The three values sum to 0.9869, equal to the replication probability obtained in Step 12, as guaranteed by the efficiency axiom for expected Shapley values.\n", "\n", "## Step 14: Banzhaf Values\n", "\n", "*Expected Banzhaf values* offer an alternative game-theoretic measure of each study's contribution. Unlike Shapley values, which average marginal contributions over all orderings of the players, Banzhaf values average over all *subsets* of the other players. This makes the computation simpler (no ordering weights) and often faster in practice. Like Shapley values, they are computed in the probabilistic sense (see the chapter on Shapley values), but they do not satisfy the efficiency axiom: they are not constrained to sum to the probability of the query result.\n", "\n", "**Hint:**\n", "\n", "> `banzhaf` takes the same two arguments as `shapley`: the combined provenance token of the query result (the \"target\", from `f_replicated`) and the individual provenance token of each input row (the \"variable\", from `f`). Reuse the same cross-join structure from Step 13.\n", "\n", "Compute Banzhaf values for Exercise→CVD→beneficial and compare them with the Shapley values from Step 13." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT fin.study,\n", " ROUND(banzhaf(target.prov, fin.prov)::numeric, 4) AS bv\n", "FROM (\n", " SELECT provenance() AS prov\n", " FROM f_replicated\n", " WHERE exposure = 'Exercise' AND outcome = 'Cardiovascular Disease'\n", " AND effect = 'beneficial'\n", ") target,\n", "(\n", " SELECT study, provenance() AS prov\n", " FROM f\n", " WHERE exposure = 'Exercise' AND outcome = 'Cardiovascular Disease'\n", " AND effect = 'beneficial'\n", ") fin\n", "ORDER BY bv DESC, study;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The relative ranking of Johnson2020, Smith2018, and Williams2021 is the same as with Shapley values, confirming that both measures agree on who contributes most to the replication probability. The absolute magnitudes are larger (around 1.7 per study) and their sum does not equal the replication probability.\n", "\n", "## Step 15: Bulk Shapley/Banzhaf with `shapley_all_vars` and `banzhaf_all_vars`\n", "\n", "The cross-join pattern from Step 13 is convenient when you only want a value for a few specific variables, but is inefficient when many input variables are involved. `shapley_all_vars` and `banzhaf_all_vars` take a single target token and return one row per input variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "CREATE TEMP TABLE target_token AS\n", "SELECT provenance() AS prov\n", "FROM f_replicated\n", "WHERE exposure = 'Exercise' AND outcome = 'Cardiovascular Disease'\n", " AND effect = 'beneficial';\n", "\n", "SELECT remove_provenance('target_token');\n", "\n", "SELECT sm.value AS study,\n", " ROUND(sav.value::numeric, 4) AS sv\n", "FROM target_token, shapley_all_vars(prov) sav\n", "JOIN study_mapping sm ON sm.provenance = sav.variable\n", "ORDER BY sv DESC, study;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output is identical to Step 13: Johnson2020, Smith2018, and Williams2021 with the same Shapley values. The difference is in the calling convention – no enumeration of variables is needed – and in efficiency, especially when there are many input variables. Replace `shapley_all_vars` with `banzhaf_all_vars` to reproduce Step 14 in the same way.\n", "\n", "**Note:**\n", "\n", "> `shapley_all_vars` returns one row per *input variable* of the circuit, keyed by its UUID token. The `study_mapping` join maps these tokens back to study titles for display.\n", "\n", "## Step 16: Arithmetic on Aggregate Results\n", "\n", "ProvSQL tracks provenance through SQL aggregates: `COUNT`, `SUM`, and similar produce *aggregate tokens* (`agg_token`) that record the underlying contributions -- and arithmetic (`*`, `+`…) over those aggregates stays inside provenance: the result is itself an aggregate token whose circuit combines the operand aggregates.\n", "\n", "Compute a composite *evidence weight* per (exposure, outcome, effect) triple combining how many studies report it (`COUNT(*)`) with the highest reliability among them (`MAX(reliability)`):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT exposure, outcome, effect,\n", " COUNT(*) AS n_studies,\n", " MAX(reliability) AS top_reliability,\n", " COUNT(*) * MAX(reliability) AS evidence_weight,\n", " (COUNT(*) * MAX(reliability))::NUMERIC AS evidence_weight_value\n", "FROM f\n", "GROUP BY exposure, outcome, effect\n", "ORDER BY evidence_weight_value, exposure, outcome, effect;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:**\n", "\n", "> `evidence_weight` is an aggregate token: it can be inspected and evaluated like any other provenance circuit, but it is not a plain number. To get its *value*, cast it to `NUMERIC` as in `evidence_weight_value` -- the cast extracts the computed number at the price of losing the provenance, making it ordinary SQL data again (usable in `ORDER BY`, comparisons, further computation). The group provenance -- the token associated with each (exposure, outcome, effect) triple -- is unaffected and remains available for `probability_evaluate`, `shapley`, etc. on the group itself." ] } ] }