{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "provsql-studio", "display_name": "ProvSQL (SQL)", "language": "sql" }, "language_info": { "name": "sql" }, "provsql": { "scheme": "semiring", "database": "cs1", "generated_from": "doc/source/user/casestudy1.rst" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Case Study: Intelligence Agency\n", "\n", "This case study, largely extending the scenario introduced when ProvSQL was first presented, demonstrates ProvSQL's custom semiring capability, where-provenance, probability computation with multiple algorithms, and circuit export through a security-classification scenario.\n", "\n", "## The Scenario\n", "\n", "An intelligence agency maintains a database of seven employees spread across three cities. Every employee holds a security clearance ranging from *unclassified* to *top secret*. Your tasks:\n", "\n", "- identify which cities are served by more than one agent,\n", "- determine the minimum clearance level needed to infer each result,\n", "- find cities with exactly one agent (sensitive: if the city leaks, the sole agent is exposed),\n", "- track where in the database each output value originated,\n", "- compute the probability that a city remains a single-agent post after accounting for possible-world uncertainty.\n", "\n", "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The following cells set up the database with all the content this notebook requires; run them first, ideally on a fresh database.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TYPE IF EXISTS classification_level CASCADE;\n", "CREATE TYPE classification_level AS ENUM (\n", " 'unclassified', 'restricted', 'confidential', 'secret', 'top_secret',\n", " 'unavailable'\n", ");" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS personnel CASCADE;\n", "CREATE TABLE personnel (\n", " id SERIAL PRIMARY KEY,\n", " name TEXT NOT NULL,\n", " position TEXT NOT NULL,\n", " city TEXT NOT NULL,\n", " classification classification_level NOT NULL\n", ");" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "TRUNCATE personnel;\n", "INSERT INTO personnel (name, position, city, classification) VALUES\n", " ('Juma', 'Director', 'Nairobi', 'unclassified'),\n", " ('Paul', 'Janitor', 'Nairobi', 'restricted'),\n", " ('David', 'Analyst', 'Paris', 'confidential'),\n", " ('Ellen', 'Field agent', 'Beijing', 'secret'),\n", " ('Aaheli', 'Double agent', 'Paris', 'top_secret'),\n", " ('Nancy', 'HR', 'Paris', 'restricted'),\n", " ('Jing', 'Analyst', 'Beijing', 'secret');" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This creates:\n", "\n", "- `classification_level` – an ordered ENUM (`unclassified` \\< `restricted` \\< `confidential` \\< `secret` \\< `top_secret` \\< `unavailable`) where `unavailable` is a sentinel representing the semiring 𝟘 (no derivation possible)\n", "- `personnel` – 7 agents with name, position, city, and clearance level\n", "\n", "## Step 1: Explore the Database\n", "\n", "Inspect the `personnel` table:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT * FROM personnel ORDER BY id;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You should see seven rows: Juma (Director, Nairobi), Paul (Janitor, Nairobi), David (Analyst, Paris), Ellen (Field agent, Beijing), Aaheli (Double agent, Paris), Nancy (HR, Paris), and Jing (Analyst, Beijing).\n", "\n", "## Step 2: Enable Provenance and Create a Name Mapping\n", "\n", "Enable provenance tracking on `personnel` and create a mapping so that provenance tokens can be labelled with agent names:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT add_provenance('personnel');\n", "DROP TABLE IF EXISTS personnel_name;\n", "SELECT create_provenance_mapping('personnel_name', 'personnel', 'name');" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After `add_provenance`, every row of `personnel` has a unique UUID token in its hidden `provsql` column. The mapping `personnel_name` associates each token with the corresponding agent's name.\n", "\n", "## Step 3: Cities Shared by Multiple Agents\n", "\n", "Which cities have at least two agents? Use a self-join with an `id` inequality to generate all unordered pairs, then apply `sr_formula` to see which agents contribute to each city:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p1.city,\n", " sr_formula(provenance(), 'personnel_name') AS formula\n", "FROM personnel p1\n", "JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id\n", "GROUP BY p1.city\n", "ORDER BY p1.city;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The formula for Nairobi is `Juma βŠ— Paul`: both agents must be present for Nairobi to appear. Beijing similarly shows `Ellen βŠ— Jing`. Paris, with three agents, shows all three pairwise products joined by `βŠ•`.\n", "\n", "## Step 4: Minimum Security Clearance (sr_minmax)\n", "\n", "For each shared city, what is the *minimum clearance level* required to have inferred that the city has multiple agents? An analyst who knows the city only needs to see the lowest-cleared agent there.\n", "\n", "This is the security shape of the min-max m-semiring, computed by the compiled function `sr_minmax` over the `classification_level` enum:\n", "\n", "- `βŠ•` (OR combination) = `min`: to infer *either* agent was involved, you only need clearance for the less-classified one (one witness suffices to establish the disjunction).\n", "- `βŠ—` (AND/join) = `max`: to confirm *both* agents are present, you need clearance for the more-classified one (you must be able to access both records to establish the join).\n", "\n", "The third argument is a sample value of the carrier enum (used only for type inference); its value is ignored:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS personnel_level;\n", "SELECT create_provenance_mapping('personnel_level',\n", " 'personnel', 'classification');\n", "\n", "SELECT p1.city,\n", " sr_minmax(provenance(), 'personnel_level',\n", " 'unclassified'::classification_level) AS min_clearance\n", "FROM personnel p1\n", "JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id\n", "GROUP BY p1.city\n", "ORDER BY p1.city;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Results: Nairobi requires `restricted` (Paul is the more-classified of the two agents, and both must be accessed to confirm the pair). Beijing requires `secret` (both Ellen and Jing hold the same level). Paris requires `confidential`: the pair David–Nancy has MAX clearance `confidential`, which is the lowest maximum among all Paris pairs, so `confidential` clearance suffices to confirm at least one pair.\n", "\n", "## Step 5: Cities with Exactly One Agent (EXCEPT / Monus)\n", "\n", "A city with a single agent is sensitive: knowing the city immediately identifies the agent. Find cities where *all* agents are alone using `EXCEPT`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT city,\n", " sr_formula(provenance(), 'personnel_name') AS formula\n", "FROM (\n", " SELECT DISTINCT city FROM personnel\n", " EXCEPT\n", " SELECT p1.city\n", " FROM personnel p1\n", " JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id\n", " GROUP BY p1.city\n", ") t\n", "ORDER BY city;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note:**\n", "\n", "> ProvSQL's `EXCEPT` uses the *monus* operator `βŠ–` of the provenance semiring rather than plain set difference. Every city appears in the result with a provenance formula; the formula evaluates to `𝟘` for cities that are definitely shared and to a non-trivial expression for cities that *could* be single-agent in some possible world. Nairobi's formula `(Juma βŠ• Paul) βŠ– (Juma βŠ— Paul)` reads: \"at least one of Juma or Paul is present, minus the event where both are.\"\n", "\n", "## Step 6: Where-Provenance\n", "\n", "Where-provenance tracks *which column of which input row* produced each output value. Enable it and re-run the shared-city query:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SET provsql.where_provenance = on;\n", "\n", "SELECT p1.city,\n", " where_provenance(provenance()) AS source\n", "FROM personnel p1\n", "JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id\n", "GROUP BY p1.city\n", "ORDER BY p1.city;\n", "\n", "SET provsql.where_provenance = off;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each city output value is traced back to the `city` column (column 4) of the `personnel` table for every agent in that city. The notation `[personnel:γ€ˆtoken〉:4;...]` shows the token and column index of each contributing row; the trailing `[]` is the untracked `source` column itself.\n", "\n", "## Step 7: Assign Probabilities\n", "\n", "Suppose the existence of each agent in the database is uncertain. Assign each agent a probability equal to `id / 10.0`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "ALTER TABLE personnel ADD COLUMN IF NOT EXISTS probability DOUBLE PRECISION;\n", "UPDATE personnel SET probability = id / 10.0;\n", "\n", "DO $$ BEGIN\n", " PERFORM set_prob(provenance(), probability) FROM personnel;\n", "END $$;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now Juma has probability 0.1, Paul 0.2, …, Jing 0.7.\n", "\n", "## Step 8: Probability – Exact\n", "\n", "Compute the exact probability that each city is a single-agent city:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT city,\n", " ROUND(probability_evaluate(provenance())::numeric, 4) AS prob\n", "FROM (\n", " SELECT DISTINCT city FROM personnel\n", " EXCEPT\n", " SELECT p1.city\n", " FROM personnel p1\n", " JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id\n", " GROUP BY p1.city\n", ") t\n", "ORDER BY city;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nairobi (agents with probabilities 0.1 and 0.2) has probability `0.1 Γ— 0.8 + 0.9 Γ— 0.2 = 0.26` of having exactly one agent. Beijing (0.4 and 0.7) scores `0.54`. Paris (0.3, 0.5, 0.6) gives `0.41`.\n", "\n", "## Step 9: Probability – Monte Carlo\n", "\n", "For larger circuits, exact evaluation can be expensive. Monte Carlo sampling gives an approximate answer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT city,\n", " probability_evaluate(\n", " provenance(), 'monte-carlo', '10000') AS prob\n", "FROM (\n", " SELECT DISTINCT city FROM personnel\n", " EXCEPT\n", " SELECT p1.city\n", " FROM personnel p1\n", " JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id\n", " GROUP BY p1.city\n", ") t\n", "ORDER BY city;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With 10 000 samples the result is accurate to roughly Β±0.01 (see [Margin of error](https://en.wikipedia.org/wiki/Margin_of_error)). Results will vary slightly between runs due to sampling. Compare the runtime against the exact method (`\\timing` in psql, or the per-query timing Studio displays).\n", "\n", "## Step 10: Probability – Knowledge Compiler\n", "\n", "**Note:**\n", "\n", "> This step requires an external knowledge compiler such as `d4` or `dsharp` to be installed and on your `PATH` (or in a directory listed in the `provsql.tool_search_path` GUC, see the configuration chapter). Skip it if neither is available.\n", "\n", "A knowledge compiler converts the provenance circuit to a *d-DNNF* representation, which enables efficient exact probability evaluation on large circuits of specific forms:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT city,\n", " ROUND(probability_evaluate(\n", " provenance(), 'compilation', 'd4')::numeric, 4) AS prob\n", "FROM (\n", " SELECT DISTINCT city FROM personnel\n", " EXCEPT\n", " SELECT p1.city\n", " FROM personnel p1\n", " JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id\n", " GROUP BY p1.city\n", ") t\n", "ORDER BY city;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare the runtime (`\\timing` in psql, or Studio's per-query timing) against the possible-worlds and Monte Carlo methods. On this small example, the external knowledge compiler will be slower than the other methods: invoking an external process and compiling the circuit carries significant overhead that only pays off on much larger circuits.\n", "\n", "## Step 11: Visualise a Provenance Circuit\n", "\n", "`view_circuit` renders the provenance circuit as an ASCII box-art diagram using [graph-easy](https://metacpan.org/dist/Graph-Easy) (must be installed and on your `PATH`):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT city, view_circuit(provenance(), 'personnel_name') AS circuit\n", "FROM (\n", " SELECT DISTINCT city FROM personnel\n", " EXCEPT\n", " SELECT p1.city\n", " FROM personnel p1\n", " JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id\n", " GROUP BY p1.city\n", ") t\n", "WHERE city = 'Nairobi';" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result shows the monus gate at the top, with `Juma` and `Paul` as leaf inputs, rendered in box-art notation in the terminal.\n", "\n", "## Step 12: Export to XML\n", "\n", "`to_provxml` serialises the provenance circuit to [PROV-XML](https://www.w3.org/TR/prov-xml/), the W3C standard for provenance interchange, which can be processed by any standard XML tool:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT city,\n", " to_provxml(provenance(), 'personnel_name') AS provxml\n", "FROM (\n", " SELECT DISTINCT city FROM personnel\n", " EXCEPT\n", " SELECT p1.city\n", " FROM personnel p1\n", " JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id\n", " GROUP BY p1.city\n", ") t\n", "WHERE city = 'Nairobi';" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 13: Large Circuit Benchmark\n", "\n", "To compare the three probability algorithms at scale, create a synthetic 100 Γ— 100 random-probability matrix and enable provenance tracking on it:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS matrix CASCADE;\n", "CREATE TABLE matrix AS\n", "SELECT ones.n + 10 * tens.n AS x,\n", " other.n + 10 * tens2.n AS y,\n", " random() AS prob\n", "FROM (VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) ones(n),\n", " (VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) tens(n),\n", " (VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) other(n),\n", " (VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) tens2(n);\n", "\n", "SELECT add_provenance('matrix');\n", "DO $$ BEGIN\n", " PERFORM set_prob(provenance(), prob) FROM matrix;\n", "END $$;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now run the same path query with each method in turn, timing each (`\\timing` in psql, or Studio's per-query timing):" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Default method (independent evaluation, tree decomposition, or d4)\n", "SELECT m1.x, m2.y,\n", " probability_evaluate(provenance()) AS prob\n", "FROM matrix m1, matrix m2\n", "WHERE m2.x = m1.y AND m1.x > 90 AND m2.x > 90 AND m2.y > 90\n", "GROUP BY m1.x, m2.y\n", "ORDER BY m1.x, m2.y;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Exact enumeration over all possible worlds\n", "SELECT m1.x, m2.y,\n", " probability_evaluate(provenance(), 'possible-worlds') AS prob\n", "FROM matrix m1, matrix m2\n", "WHERE m2.x = m1.y AND m1.x > 90 AND m2.x > 90 AND m2.y > 90\n", "GROUP BY m1.x, m2.y\n", "ORDER BY m1.x, m2.y;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Monte Carlo sampling (9604 samples β‰ˆ 1 % error at 95 % confidence)\n", "SELECT m1.x, m2.y,\n", " probability_evaluate(provenance(), 'monte-carlo', '9604') AS prob\n", "FROM matrix m1, matrix m2\n", "WHERE m2.x = m1.y AND m1.x > 90 AND m2.x > 90 AND m2.y > 90\n", "GROUP BY m1.x, m2.y\n", "ORDER BY m1.x, m2.y;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Tree-decomposition-based exact compilation (built-in, no external tool)\n", "SELECT m1.x, m2.y,\n", " probability_evaluate(provenance(), 'tree-decomposition') AS prob\n", "FROM matrix m1, matrix m2\n", "WHERE m2.x = m1.y AND m1.x > 90 AND m2.x > 90 AND m2.y > 90\n", "GROUP BY m1.x, m2.y\n", "ORDER BY m1.x, m2.y;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Knowledge compilation via external tool d4\n", "SELECT m1.x, m2.y,\n", " probability_evaluate(provenance(), 'compilation', 'd4') AS prob\n", "FROM matrix m1, matrix m2\n", "WHERE m2.x = m1.y AND m1.x > 90 AND m2.x > 90 AND m2.y > 90\n", "GROUP BY m1.x, m2.y\n", "ORDER BY m1.x, m2.y;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Monte Carlo query uses `9604` samples, which gives roughly 1 % additive error with 95 % confidence (by the formula $`n = z^2 / (4\\varepsilon^2)`$ with $`z = 1.96`$, $`\\varepsilon = 0.01`$; see [Margin of error](https://en.wikipedia.org/wiki/Margin_of_error)).\n", "\n", "The `'tree-decomposition'` method is exact and built into ProvSQL (no external binary required). It is often the fastest exact method on simple queries, but it fails on circuits with high treewidth – when that happens, fall back to `'compilation'` or one of the other methods.\n", "\n", "## Step 14: The Boolean Expression Behind a Token\n", "\n", "`sr_boolexpr` returns the abstract Boolean formula of a provenance circuit. Without a mapping it uses internal variable names `x0`, `x1`…; with an optional second argument naming a provenance mapping table the leaves are labelled by the mapping's `value` column instead. This is the same expression ProvSQL hands to its d-DNNF compilers internally to compute probabilities." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT city, sr_boolexpr(provenance()) AS boolexpr\n", "FROM (\n", " SELECT DISTINCT city FROM personnel\n", " EXCEPT\n", " SELECT p1.city\n", " FROM personnel p1\n", " JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id\n", " GROUP BY p1.city\n", ") t\n", "WHERE city = 'Nairobi';" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For Nairobi, the result is the circuit `(Juma βŠ• Paul) βŠ– (Juma βŠ— Paul)` from Step 5, interpreted in the Boolean function semiring – every provenance gate is mapped to its Boolean counterpart (`βŠ•` to `∨`, `βŠ—` to `∧`, `βŠ–` to `∧¬`) – and the resulting Boolean function rendered as a formula over anonymous variables. Unlike `sr_formula`, the provenance mapping is optional: the expression captures the circuit's logical structure independently of any naming, and a mapping can be supplied later if you want the leaves labelled.\n", "\n", "## Step 15: Programmatic Circuit Inspection\n", "\n", "What `view_circuit` renders, you can also walk programmatically with the low-level circuit API. Capture Nairobi's monus token first:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "CREATE TEMP TABLE nairobi_token AS\n", "SELECT provenance() AS prov\n", "FROM (\n", " SELECT DISTINCT city FROM personnel\n", " EXCEPT\n", " SELECT p1.city\n", " FROM personnel p1\n", " JOIN personnel p2 ON p1.city = p2.city AND p1.id < p2.id\n", " GROUP BY p1.city\n", ") t\n", "WHERE city = 'Nairobi';" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`get_nb_gates` reports how many gates have been materialized in the current database's circuit:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT get_nb_gates();" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`get_gate_type` and `get_children` give a single-step view of the gate structure: they return the operator and direct children of a gate." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT get_gate_type(prov) AS root_type,\n", " get_children(prov) AS root_children\n", "FROM nairobi_token;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For Nairobi, the root is a `monus` gate with two children: the βŠ• sub-circuit (`Juma βŠ• Paul`) and the βŠ— sub-circuit (`Juma βŠ— Paul`). Recurse to inspect the children:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT (get_children(prov))[1] AS plus_token,\n", " get_gate_type((get_children(prov))[1]) AS plus_type,\n", " get_children((get_children(prov))[1]) AS plus_children\n", "FROM nairobi_token;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The leaves of the circuit are input gates that originate from the `personnel` table. `identify_token` performs a reverse lookup, returning the table and column count for an input token:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT identify_token(child) AS source\n", "FROM nairobi_token, unnest(get_children((get_children(prov))[1])) AS child;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Both leaves resolve to `(personnel, 6)` – the `personnel` table with its six non-provenance columns (`id`, `name`, `position`, `city`, `classification`, and the `probability` column added in Step 7). This is exactly the traversal `view_circuit` performs to render the box-art diagram." ] } ] }