{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "provsql-studio", "display_name": "ProvSQL (SQL)", "language": "sql" }, "language_info": { "name": "sql" }, "provsql": { "scheme": "semiring", "database": "cs7", "generated_from": "doc/source/user/casestudy7.rst" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Case Study: Peer-Review Assignment and Knowledge Compilation\n", "\n", "This case study looks at ProvSQL from the angle of **knowledge compilation** (see the knowledge-compilation chapter): how the *shape* of a SQL query, **together with the keys the schema declares**, determines the shape of the Boolean provenance circuit it produces, and how that, in turn, decides which probability method is cheap and which needs a full compiler. It is driven through ProvSQL Studio, where the circuit, its CNF, the compiled d-DNNF, and the method comparison all sit side by side.\n", "\n", "The thread running through the case study is a single coverage question asked three ways over the **same data**:\n", "\n", "- asked so that it is **safe by shape** (hierarchical) -- read-once, probability is trivial;\n", "- asked about **one paper**, where it is genuinely entangled yet still **read-once thanks to a key** (the primary key on `expertise`);\n", "- asked about the **whole program**, where it is genuinely [\\#P-hard](https://en.wikipedia.org/wiki/%E2%99%AFP) and a real compiler earns its keep.\n", "\n", "These three are the first rungs of a ladder, none recursive. The case study then climbs the rest, all over the same instance: a `HAVING count(*)` query that a probability-side pre-pass resolves before any compiler runs; a self-join that is **inversion-free** -- hard to every circuit-level method, yet linear-time from a query-derived variable order; and a `repair_key` table whose block correlation stays tractable too. A closing section then turns to **recursive** reachability, where provenance becomes network reliability.\n", "\n", "## The Scenario\n", "\n", "We consider a conference-reviewing setup. Eight relations carry uncertainty -- seven tuple-independent, plus the block-correlated `assignment` -- alongside the deterministic dimension tables `reviewers` / `papers` / `topics`. Three tuple-independent relations make up the core instance and drive the coverage queries; the rest are introduced where they are first used -- two graphs queried recursively in Step 9, `recommend` / `champion` in Step 7, and `assignment` in Step 8:\n", "\n", "- `bid(reviewer, paper)` -- a reviewer offered to review a paper; the confidence is how firm the bid is (availability, willingness).\n", "- `expertise(reviewer, topic)` -- the reviewer's area of competence.\n", "- `topic_of(paper, topic)` -- the paper is about a topic.\n", "\n", "The instance has 14 reviewers, 4 topics and 7 papers. A key modelling choice drives the whole study: `expertise` is declared with a **primary key on** `reviewer` -- each reviewer has exactly one area. That single [functional dependency](https://en.wikipedia.org/wiki/Functional_dependency), `reviewer` $`\\to`$ `topic`, is what makes the per-paper coverage query safe, as we will see. Several reviewers deliberately **share** each area (five are database experts, including Alice, Bob and Judy), so a paper's coverage lineage is genuinely entangled: the same `topic_of` tuple is shared by all the co-experts who bid on the paper.\n", "\n", "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The following cells set up the database with all the content this notebook requires; run them first, ideally on a fresh database.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS bid CASCADE;\n", "DROP TABLE IF EXISTS expertise CASCADE;\n", "DROP TABLE IF EXISTS topic_of CASCADE;\n", "DROP TABLE IF EXISTS extends CASCADE;\n", "DROP TABLE IF EXISTS coreview CASCADE;\n", "DROP TABLE IF EXISTS assignment CASCADE;\n", "DROP TABLE IF EXISTS reviewers CASCADE;\n", "DROP TABLE IF EXISTS papers CASCADE;\n", "DROP TABLE IF EXISTS topics CASCADE;\n", "DROP TABLE IF EXISTS bid_label, expertise_label, topic_of_label,\n", " extends_label, coreview_label;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Dimension tables (deterministic, no provenance).\n", "DROP TABLE IF EXISTS reviewers CASCADE;\n", "CREATE TABLE reviewers (id TEXT PRIMARY KEY, name TEXT NOT NULL);\n", "INSERT INTO reviewers VALUES\n", " ('r1','Alice'), ('r2','Bob'), ('r3','Carol'), ('r4','Dave'),\n", " ('r5','Eve'), ('r6','Frank'), ('r7','Grace'), ('r8','Heidi'),\n", " ('r9','Ivan'), ('r10','Judy'), ('r11','Karl'), ('r12','Lara'),\n", " ('r13','Mona'), ('r14','Nick');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS papers CASCADE;\n", "CREATE TABLE papers (id TEXT PRIMARY KEY, title TEXT NOT NULL);\n", "INSERT INTO papers VALUES\n", " ('p1', 'A Provenance Circuit Calculus'),\n", " ('p2', 'Treewidth Bounds for Safe Queries'),\n", " ('p3', 'Sampling Beats Compilation, Sometimes'),\n", " ('p4', 'Functional Dependencies and Read-Once Lineage'),\n", " ('p5', 'A Dichotomy for Conjunctive Coverage'),\n", " ('p6', 'Knowledge Compilation in Practice'),\n", " ('p7', 'Repair-Key Semantics for Assignments');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS topics CASCADE;\n", "CREATE TABLE topics (id TEXT PRIMARY KEY, name TEXT NOT NULL);\n", "INSERT INTO topics VALUES\n", " ('t1','databases'), ('t2','logic'), ('t3','systems'), ('t4','theory');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- bid(reviewer, paper): the reviewer offered to review the paper; conf\n", "-- is how firm the bid is. Several databases-experts bid on p1, which is\n", "-- what makes p1's coverage interesting.\n", "DROP TABLE IF EXISTS bid CASCADE;\n", "CREATE TABLE bid (\n", " reviewer TEXT NOT NULL REFERENCES reviewers(id),\n", " paper TEXT NOT NULL REFERENCES papers(id),\n", " conf DOUBLE PRECISION NOT NULL,\n", " lbl TEXT,\n", " PRIMARY KEY (reviewer, paper) -- a reviewer bids on a paper once\n", ");\n", "INSERT INTO bid(reviewer, paper, conf) VALUES\n", " ('r1','p1',0.5), ('r1','p4',0.35),\n", " ('r2','p1',0.45),('r2','p2',0.4),\n", " ('r5','p2',0.4), ('r5','p4',0.3),\n", " ('r10','p1',0.4),\n", " ('r13','p2',0.35),('r13','p4',0.45),\n", " ('r3','p1',0.5), ('r3','p3',0.35),('r3','p5',0.4),\n", " ('r6','p3',0.4), ('r6','p5',0.35),\n", " ('r9','p7',0.45),('r9','p3',0.3),\n", " ('r4','p5',0.4),\n", " ('r7','p5',0.35),\n", " ('r11','p6',0.45),('r11','p7',0.35),\n", " ('r14','p6',0.4),\n", " ('r8','p4',0.35);\n", "UPDATE bid SET lbl = format('bid(%s,%s)', reviewer, paper);\n", "SELECT add_provenance('bid');\n", "SELECT set_prob(provenance(), conf) FROM bid;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- expertise(reviewer, topic): the reviewer's area of competence.\n", "-- KEY ON reviewer (not (reviewer, topic)): one area per reviewer. This\n", "-- functional dependency reviewer -> topic is exactly what makes a single\n", "-- paper's coverage query read-once (the safe plan groups on the topic and\n", "-- projects the reviewer out); widen the key to (reviewer, topic) and that\n", "-- query stops being safe. Several reviewers share each area on purpose\n", "-- (five in databases), so a paper's coverage lineage is genuinely\n", "-- entangled (the same topic_of leaf is shared across co-experts).\n", "DROP TABLE IF EXISTS expertise CASCADE;\n", "CREATE TABLE expertise (\n", " reviewer TEXT NOT NULL REFERENCES reviewers(id),\n", " topic TEXT NOT NULL REFERENCES topics(id),\n", " conf DOUBLE PRECISION NOT NULL,\n", " lbl TEXT,\n", " PRIMARY KEY (reviewer)\n", ");\n", "INSERT INTO expertise(reviewer, topic, conf) VALUES\n", " ('r1','t1',0.55),('r2','t1',0.5), ('r5','t1',0.5), ('r10','t1',0.45),('r13','t1',0.45),\n", " ('r3','t2',0.55),('r6','t2',0.5), ('r9','t2',0.45),\n", " ('r4','t3',0.5), ('r7','t3',0.5), ('r11','t3',0.45),('r14','t3',0.45),\n", " ('r8','t4',0.55),('r12','t4',0.5);\n", "UPDATE expertise SET lbl = format('exp(%s,%s)', reviewer, topic);\n", "SELECT add_provenance('expertise');\n", "SELECT set_prob(provenance(), conf) FROM expertise;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- topic_of(paper, topic): the paper is about the topic. Papers overlap\n", "-- on topics (p1, p2, p4, p6 all touch databases), which is what couples\n", "-- their coverage when the paper variable is left free.\n", "DROP TABLE IF EXISTS topic_of CASCADE;\n", "CREATE TABLE topic_of (\n", " paper TEXT NOT NULL REFERENCES papers(id),\n", " topic TEXT NOT NULL REFERENCES topics(id),\n", " conf DOUBLE PRECISION NOT NULL,\n", " lbl TEXT,\n", " PRIMARY KEY (paper, topic) -- a paper lists a topic once\n", ");\n", "INSERT INTO topic_of(paper, topic, conf) VALUES\n", " ('p1','t1',0.6), ('p1','t2',0.55),\n", " ('p2','t1',0.55),('p2','t3',0.5),\n", " ('p3','t2',0.55),('p3','t4',0.45),\n", " ('p4','t1',0.5), ('p4','t4',0.5),\n", " ('p5','t3',0.55),('p5','t2',0.45),\n", " ('p6','t1',0.55),('p6','t3',0.5), ('p6','t4',0.4),\n", " ('p7','t2',0.5), ('p7','t3',0.45);\n", "UPDATE topic_of SET lbl = format('topof(%s,%s)', paper, topic);\n", "SELECT add_provenance('topic_of');\n", "SELECT set_prob(provenance(), conf) FROM topic_of;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- extends(citing, cited): paper `citing` builds on the earlier paper\n", "-- `cited`. ACYCLIC by construction (a paper only extends older ones), so\n", "-- the recursive \"what does p transitively build on?\" query is read-once\n", "-- per ancestor and works for any semiring (used in the recursive section\n", "-- with sr_formula and probability). Requires PostgreSQL 15+ to query.\n", "DROP TABLE IF EXISTS extends CASCADE;\n", "CREATE TABLE extends (\n", " citing TEXT NOT NULL REFERENCES papers(id),\n", " cited TEXT NOT NULL REFERENCES papers(id),\n", " conf DOUBLE PRECISION NOT NULL,\n", " lbl TEXT,\n", " PRIMARY KEY (citing, cited)\n", ");\n", "INSERT INTO extends(citing, cited, conf) VALUES\n", " ('p4','p1',0.8), ('p5','p2',0.7), ('p5','p3',0.6),\n", " ('p6','p4',0.9), ('p6','p5',0.7), ('p7','p3',0.6);\n", "UPDATE extends SET lbl = format('ext(%s,%s)', citing, cited);\n", "SELECT add_provenance('extends');\n", "SELECT set_prob(provenance(), conf) FROM extends;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- coreview(a, b): reviewers a and b have served on a committee together.\n", "-- The relation is SYMMETRIC (both directions are stored), so the\n", "-- collaboration graph is CYCLIC -- the recursive \"who is reviewer r\n", "-- connected to?\" query only terminates under provsql.boolean_provenance,\n", "-- where it computes connection *reliability* (a network-reliability /\n", "-- #P-hard flavour). Requires PostgreSQL 15+ to query.\n", "DROP TABLE IF EXISTS coreview CASCADE;\n", "CREATE TABLE coreview (\n", " a TEXT NOT NULL REFERENCES reviewers(id),\n", " b TEXT NOT NULL REFERENCES reviewers(id),\n", " conf DOUBLE PRECISION NOT NULL,\n", " lbl TEXT,\n", " PRIMARY KEY (a, b)\n", ");\n", "INSERT INTO coreview(a, b, conf) VALUES\n", " ('r1','r2',0.8),('r2','r1',0.8),\n", " ('r2','r3',0.7),('r3','r2',0.7),\n", " ('r1','r3',0.5),('r3','r1',0.5),\n", " ('r3','r4',0.6),('r4','r3',0.6),\n", " ('r4','r5',0.7),('r5','r4',0.7),\n", " ('r2','r5',0.4),('r5','r2',0.4);\n", "UPDATE coreview SET lbl = format('co(%s,%s)', a, b);\n", "SELECT add_provenance('coreview');\n", "SELECT set_prob(provenance(), conf) FROM coreview;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- assignment(reviewer, paper): the candidate papers a reviewer could be\n", "-- assigned to. repair_key on `reviewer` makes the rows for one reviewer\n", "-- mutually exclusive (each reviewer ends up on exactly one paper), so a\n", "-- query over this table carries correlated provenance.\n", "DROP TABLE IF EXISTS assignment CASCADE;\n", "CREATE TABLE assignment (\n", " reviewer TEXT NOT NULL REFERENCES reviewers(id),\n", " paper TEXT NOT NULL REFERENCES papers(id)\n", ");\n", "INSERT INTO assignment(reviewer, paper) VALUES\n", " ('r1','p1'), ('r1','p4'),\n", " ('r2','p1'), ('r2','p2'),\n", " ('r3','p1'), ('r3','p3'),\n", " ('r4','p2'), ('r4','p5');\n", "SELECT repair_key('assignment', 'reviewer');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Inversion-free demo fixture (Step 7). A self-join witness that is NOT\n", "-- read-once yet is inversion-free, so a single global variable order\n", "-- compiles it in linear time while a generic compiler / tree decomposition\n", "-- chokes. Olga (r15) is a prolific bidder who skimmed a 24-paper\n", "-- submission batch (q01..q24); `recommend` and `champion` are two\n", "-- post-review signals (she recommended a paper for acceptance, and would\n", "-- champion it at the PC meeting). The witness query asks for a reviewer\n", "-- whose bids overlap both a recommendation and a championing; grouping on\n", "-- the reviewer shares the bid(r15,*) leaves between the two sides, so the\n", "-- lineage is not read-once but stays inversion-free (root = reviewer, a\n", "-- consistent-unification self-join on `bid`).\n", "--\n", "-- Kept isolated from the core instance so Steps 1-6 and 8 are unchanged:\n", "-- Olga has no `expertise` row and the submission papers carry no\n", "-- `topic_of`, so none of this data reaches the coverage / recursive\n", "-- queries (which all join through expertise / topic_of / the graphs).\n", "DELETE FROM bid WHERE reviewer = 'r15';\n", "DELETE FROM papers WHERE id LIKE 'q%';\n", "DELETE FROM reviewers WHERE id = 'r15';\n", "INSERT INTO reviewers VALUES ('r15','Olga');\n", "INSERT INTO papers SELECT 'q'||to_char(g,'FM00'), format('Submission %s', g)\n", " FROM generate_series(1,24) g;\n", "INSERT INTO bid(reviewer, paper, conf, lbl)\n", " SELECT 'r15', p.id, 0.5, format('bid(r15,%s)', p.id)\n", " FROM papers p WHERE p.id LIKE 'q%';\n", "SELECT set_prob(provenance(), 0.5) FROM bid WHERE reviewer = 'r15';" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS recommend CASCADE;\n", "CREATE TABLE recommend (\n", " reviewer TEXT NOT NULL REFERENCES reviewers(id),\n", " paper TEXT NOT NULL REFERENCES papers(id),\n", " lbl TEXT,\n", " PRIMARY KEY (reviewer, paper)\n", ");\n", "INSERT INTO recommend(reviewer, paper, lbl)\n", " SELECT 'r15', p.id, format('rec(r15,%s)', p.id)\n", " FROM papers p WHERE p.id LIKE 'q%';\n", "SELECT add_provenance('recommend');\n", "SELECT set_prob(provenance(), 0.4) FROM recommend;\n", "SELECT create_provenance_mapping('recommend_label', 'recommend', 'lbl');\n", "ALTER TABLE recommend DROP COLUMN lbl;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS champion CASCADE;\n", "CREATE TABLE champion (\n", " reviewer TEXT NOT NULL REFERENCES reviewers(id),\n", " paper TEXT NOT NULL REFERENCES papers(id),\n", " lbl TEXT,\n", " PRIMARY KEY (reviewer, paper)\n", ");\n", "INSERT INTO champion(reviewer, paper, lbl)\n", " SELECT 'r15', p.id, format('champ(r15,%s)', p.id)\n", " FROM papers p WHERE p.id LIKE 'q%';\n", "SELECT add_provenance('champion');\n", "SELECT set_prob(provenance(), 0.3) FROM champion;\n", "SELECT create_provenance_mapping('champion_label', 'champion', 'lbl');\n", "ALTER TABLE champion DROP COLUMN lbl;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Label mappings so the Studio eval-strip's sr_formula / sr_why / sr_how\n", "-- and PROV-XML export name the leaves instead of showing raw UUIDs. The\n", "-- `lbl` columns exist only to feed create_provenance_mapping, which copies\n", "-- their values into standalone (value, provenance) tables.\n", "SELECT create_provenance_mapping('bid_label', 'bid', 'lbl');\n", "SELECT create_provenance_mapping('expertise_label','expertise','lbl');\n", "SELECT create_provenance_mapping('topic_of_label', 'topic_of', 'lbl');\n", "SELECT create_provenance_mapping('extends_label', 'extends', 'lbl');\n", "SELECT create_provenance_mapping('coreview_label', 'coreview', 'lbl');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- A single `label` mapping: the union of every per-tuple label above, so a\n", "-- text-based semiring (sr_formula / sr_why / sr_how) can name the leaves of\n", "-- a query spanning several relations through one mapping argument, rather\n", "-- than picking the relation-specific mapping each time.\n", "DROP TABLE IF EXISTS label CASCADE;\n", "CREATE TABLE label AS\n", " SELECT value, provenance FROM bid_label\n", " UNION ALL SELECT value, provenance FROM expertise_label\n", " UNION ALL SELECT value, provenance FROM topic_of_label\n", " UNION ALL SELECT value, provenance FROM extends_label\n", " UNION ALL SELECT value, provenance FROM coreview_label\n", " UNION ALL SELECT value, provenance FROM recommend_label\n", " UNION ALL SELECT value, provenance FROM champion_label;\n", "CREATE INDEX ON label(provenance);" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- `conf` only fed set_prob and `lbl` only fed create_provenance_mapping;\n", "-- both jobs are now done, so drop the artifact columns from the base tables.\n", "ALTER TABLE bid DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE expertise DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE topic_of DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE extends DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE coreview DROP COLUMN conf, DROP COLUMN lbl;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Confidences are seeded by the script (a per-row `conf` column fed to `set_prob`).\n", "\n", "The schema panel tags every relation added with `add_provenance` with a `prov-tid` pill -- their tuples are independent -- `assignment` (set up with `repair_key`) with a `prov-bid` pill marking it block-correlated, and leaves the dimension tables (`reviewers`, `papers`, `topics`) plain. Key columns are underlined: solid for a primary key, dotted for a `repair_key` grouping key (so `assignment`'s `reviewer` shows the dotted underline). The `*_label` mappings let the eval strip's `sr_formula` / `sr_why` / `sr_how` name the leaves.\n", "\n", "Most of the probability queries below are run twice, with `provsql.boolean_provenance` off and on. With the GUC **off** (the default) ProvSQL builds the provenance circuit literally from the join plan. With it **on**, ProvSQL's safe-query rewriter recognises hierarchical conjunctive queries -- including those made hierarchical by a key -- and rewrites them into a read-once form, so the linear-time `independent` method applies. In Studio the GUC is controlled by the `Boolean` position of the three-way `Semiring` / `Boolean` / `Where` provenance toggle (see `studio-query-toggles`); switching to `Semiring` turns it back off.\n", "\n", "## Step 1: Safe by Shape\n", "\n", "\"Which papers have a reviewer who bid on them and has some area of expertise?\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, p.title\n", "FROM bid b, expertise e, papers p\n", "WHERE b.reviewer = e.reviewer AND b.paper = p.id\n", "GROUP BY p.id, p.title\n", "ORDER BY p.id" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This query is **hierarchical** -- among its existential variables, the atoms mentioning `topic` (just `expertise`) are a subset of those mentioning `reviewer` (`bid` and `expertise`) -- so it is safe. Hierarchy is a property of the query alone, independent of any key: `topic` is a dangling existential, and a paper's lineage is `OR` over reviewers of `bid ∧ (the reviewer has some expertise)`, with nothing shared.\n", "\n", "Click into `p1`'s `provsql` cell. In Circuit mode the circuit is a shallow `⊕` of `⊗` pairs, nothing shared -- the shape ProvSQL's default evaluation order happens to build for this query. That read-once shape is not guaranteed in general: the literal circuit a plan materialises can reuse leaves and stop being read-once -- as the very next step shows -- which is what motivates the rewriter and the compilers of the steps that follow. In the eval strip, pick `Marginal probability` with the `independent` method: it returns `≈ 0.666` **exactly and instantly**, because `independent` is applicable precisely to read-once circuits. The same structure also compiles for free: pick *Compiled d-D circuit* with the *interpret as d-D* route and ProvSQL reads the Boolean circuit straight into a d-D --no [tree decomposition](https://en.wikipedia.org/wiki/Tree_decomposition), no external compiler -- a handful of gates, because a read-once circuit can be read off as a d-D in a single structural pass, with no search. *Tree decomposition* reaches the same place from the other side ([treewidth](https://en.wikipedia.org/wiki/Treewidth) **1**, a similarly small d-DNNF), as does any external compiler. This works the same with `boolean_provenance` off or on.\n", "\n", "## Step 2: Safe by a Key\n", "\n", "Now a sharper, genuinely entangled question -- \"is paper `p1` **competently covered**: did some reviewer bid on it **and** is that reviewer an expert in one of `p1`'s own topics?\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT DISTINCT 1\n", "FROM bid b, expertise e, topic_of t\n", "WHERE b.reviewer = e.reviewer\n", " AND e.topic = t.topic\n", " AND b.paper = 'p1' AND t.paper = 'p1'" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the fixed paper `p1`, this is the opposite of Step 1's shape: `bid` mentions only `reviewer`, `topic_of` only `topic`, and `expertise` mentions both, so neither variable's atoms nest inside the other's -- the query is **non-hierarchical**, the hard case. On our data the entanglement is concrete: Alice, Bob and Judy are all database experts who bid on `p1`, and `p1` is about `databases` (topic `t1`), so the single tuple `topic_of(p1, t1)` is shared across *three* of `p1`'s coverage disjuncts. (A fourth disjunct, Carol via `logic` -- `p1`'s other topic `t2` -- reuses `topic_of(p1, t2)` the same way.)\n", "\n", "The fixture ships a single `label` mapping -- the union of every relation's per-tuple labels -- so a text-based [semiring](https://en.wikipedia.org/wiki/Semiring) can name the leaves of a query spanning several relations through one mapping argument. In the eval strip, `sr_formula` with the `label` mapping prints `p1`'s coverage as a disjunction of four `bid ⊗ exp ⊗ topof` products, with `topof(p1,t1)` written out in three of them -- the shared leaf, in plain sight; `sr_why` and `sr_how` read the same mapping.\n", "\n", "Run it with `boolean_provenance` **off** and `independent`: it **errors** -- *\"Not an independent circuit\"*. The literal circuit (treewidth 2) reuses the `topic_of(p1, t1)` leaf, so it is not read-once.\n", "\n", "Now turn `boolean_provenance` **on** and run `independent` again: it succeeds, returning `≈ 0.4259`, matching *Tree decomposition* exactly. The key is what makes the difference. Because `expertise` has a primary key on `reviewer` (`reviewer` $`\\to`$ `topic`), each reviewer contributes a single topic, so the safe plan can **group the bidding experts by topic and factor the shared** `topic_of` **leaf out**:\n", "\n", "``` math\n", "\\bigvee_{t} \\; \\mathit{topic\\_of}(p_1, t) \\;\\wedge\\;\n", " \\Bigl(\\bigvee_{r:\\,\\mathit{exp}(r,t)} \\mathit{bid}(r, p_1) \\wedge\n", " \\mathit{exp}(r, t)\\Bigr).\n", "```\n", "\n", "Each leaf now appears once: the lineage *is* read-once, and the rewriter emits exactly this factorisation. This is the case study's first lesson: **safety is a property of the query and the integrity constraints together**, not of the query alone.\n", "\n", "To watch the key do the work, drop it and re-run the coverage query's `independent` evaluation under `boolean_provenance` **on**:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "ALTER TABLE expertise DROP CONSTRAINT expertise_pkey;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the functional dependency `reviewer` $`\\to`$ `topic` gone, a reviewer may span several areas, the question becomes genuinely non-hierarchical, and `independent` **errors again** -- even with `boolean_provenance` on, the rewriter can no longer factor the shared `topic_of` leaf out. Add the key back to restore safety:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "ALTER TABLE expertise DROP CONSTRAINT IF EXISTS expertise_pkey;\n", "ALTER TABLE expertise ADD PRIMARY KEY (reviewer);" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Genuinely Hard\n", "\n", "The key rescued *one paper*. Ask the same thing about the **whole program** and it does not:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT DISTINCT 1\n", "FROM bid b, expertise e, topic_of t\n", "WHERE b.reviewer = e.reviewer\n", " AND e.topic = t.topic\n", " AND t.paper = b.paper" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Is **any** paper competently covered?\" Here `paper` is no longer fixed: it is a third existential variable, and `bid` and `topic_of` both mention it. Now the three variables `reviewer`, `paper` and `topic` form a cycle with no nesting anywhere, whose probability is $`\\#P`$-hard in general. The primary key on `expertise` does **not** save it: under the FD the topic is still shared across papers, so after the FD reduction the query remains non-hierarchical (`reviewer` and `paper` overlap on `bid` without nesting). Turn `boolean_provenance` on and the provenance circuit is unchanged -- the safe-query rewriter finds no safe plan and leaves the literal circuit in place -- so `independent` errors just as it does with the GUC off. (Raising `provsql.verbose_level` would log the rewriter falling through, but by default you simply observe that the circuit, and the error, are the same.)\n", "\n", "The result is a single row, so Studio displays its circuit automatically: it is visibly bushy. Run `Marginal probability` with `tree-decomposition` (or any external compiler): it succeeds, returning `≈ 0.8818`. The treewidth is **4** (a property of the circuit, not the compiler), against **1** for the safe query; compiling it (the *Compiled d-D circuit* option with, say, `d4`) yields a d-DNNF of order a thousand nodes -- against the safe query's few dozen -- though the precise representation, and its size, depend on the compiler. The hardness is in the shape, and a real compiler is what turns it into a number.\n", "\n", "## Step 4: From Circuit to CNF, and Back\n", "\n", "Pin the hard query's provenance and pick *Tseytin CNF* from the knowledge-compilation group of the eval strip. The panel shows the DIMACS [CNF](https://en.wikipedia.org/wiki/Conjunctive_normal_form) ProvSQL streams to an external compiler, with one `c input` comment line per variable recording which provenance input it stands for; Studio annotates each with the source tuple it resolves to. This is what lets you take a satisfying assignment or a [weighted count](https://en.wikipedia.org/wiki/Sharp-SAT) from an external tool and read it back against the reviewing data. The same mapping is available as a table through `tseytin_cnf_mapping`.\n", "\n", "## Step 5: Compile, Measure, Compare\n", "\n", "Pick *Compiled d-D circuit* and a compiler (`d4`, `c2d`…): Studio renders the d-DNNF on the canvas with a gates / edges / depth summary. `ddnnf_stats` exposes the same numbers as `jsonb` (node and edge counts, the AND/OR/NOT/input split, smoothness, depth, treewidth, compile time), so the same circuit compiled by different tools can be compared quantitatively rather than by eye.\n", "\n", "For the whole picture at once, pick *Probability benchmark*: it times every probability method on the hard circuit, one row each, and its `d-DNNF (n/e)` column shows the compiled size next to the run time. The exact methods that finish (`tree-decomposition`, the `compilation` tools, the model counters) agree to full precision; `monte-carlo` lands in its confidence band; `independent` shows its error; and backends that do not scale to this circuit -- such as `possible-worlds` enumeration -- hit the `statement_timeout`. To export the compiled circuit itself, pick *Compiled d-D (NNF text)*: the copy button yields a c2d/d4 `.nnf` file whose variable numbering matches the CNF from Step 4.\n", "\n", "Which compilers and counters appear in those pickers depends on what is installed. The **Tools** panel (the wrench icon in the top nav) lists the external-tool registry -- every `compile` and `wmc` tool ProvSQL knows, each flagged whether it is *available* on your backend (its binary resolves on the server's `PATH`) and *enabled* -- read live from `provsql.tools` (see `/user/tool-registry`). The compiler dropdown and the benchmark offer only the available, enabled ones, so the rows you see reflect your machine.\n", "\n", "## Step 6: A Shortcut Before Compilation\n", "\n", "Some queries never reach the compiler at all, because a **probability-side pre-pass** resolves part of the circuit first. Ask for papers with at least two bidding experts:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, p.title\n", "FROM bid b, expertise e, papers p\n", "WHERE b.reviewer = e.reviewer AND b.paper = p.id\n", "GROUP BY p.id, p.title\n", "HAVING count(*) >= 2\n", "ORDER BY p.id" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `HAVING count(*) >= 2` clause becomes a comparison gate over the group's contributing rows. When computing the probability of query results, before any d-DNNF compiler runs, `probability_evaluate` folds the provenance with a closed-form [Poisson-binomial](https://en.wikipedia.org/wiki/Poisson_binomial_distribution) evaluation, replacing the entire provenance with a Bernoulli gate. Compute the probability for one of the papers (say, `p1`): ProvSQL emits a NOTICE reporting that the `gate_cmp` was shortcut by the pre-pass and how many gates the circuit shrank by. All probability methods (even `independent`) compute the probability easily.\n", "\n", "## Step 7: Hard-Looking, Secretly Linear\n", "\n", "Steps 3-5 sent a $`\\#P`$-hard query to a compiler; Step 6 skipped the compiler with a count-threshold shortcut. Here is a third escape: a query whose lineage is **not** read-once -- so `independent` rejects it, and a generic compiler treats it as hard -- yet whose *shape* admits a linear-size [OBDD](https://en.wikipedia.org/wiki/Binary_decision_diagram). ProvSQL recognises this **inversion-free** class (the `inversion-free` method described in the chapter on probabilities) at planning time and evaluates it with a structured d-DNNF built over a query-derived variable order, instead of reaching for a compiler.\n", "\n", "For this step the fixture adds one prolific bidder, **Olga** (`r15`), who skimmed a 24-paper submission batch (`q01` to `q24`), and two post-review signals: `recommend(reviewer, paper)` -- she recommended the paper for acceptance -- and `champion(reviewer, paper)` -- she would champion it at the PC meeting. Ask for a reviewer whose bids overlap **both** a recommendation **and** a championing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT r.id, r.name\n", "FROM bid b1, recommend a, bid b2, champion c, reviewers r\n", "WHERE b1.reviewer = a.reviewer AND b1.paper = a.paper\n", " AND b1.reviewer = b2.reviewer\n", " AND b2.reviewer = c.reviewer AND b2.paper = c.paper\n", " AND b1.reviewer = r.id\n", "GROUP BY r.id, r.name" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Grouping on the reviewer `OR`s the two evidence sides over all of Olga's papers, and both sides **share** the `bid(r15, *)` leaves -- so the lineage is not read-once and `independent` rejects it. But it is a *consistent-unification self-join* on `bid` with a single root variable (`reviewer`), which is exactly the inversion-free condition. In Circuit mode the per-row root carries a teal `IF` badge (the inversion-free certificate); pin it to read the certificate header and the variable-block order.\n", "\n", "Now compare methods on that one row. `tree-decomposition` gives up (*\"Treewidth greater than 10\"*); `possible-worlds` refuses (the witness has more than 64 inputs); and a real compiler does not finish -- with a `statement_timeout` set, `d4` is cut off on the 24-paper instance, exactly as the hard query of Step 3 was. Yet `Marginal\n", "probability` with the `inversion-free` method -- and the **default** method, which takes the inversion-free rung automatically once the certificate is present -- returns `0.975314` in milliseconds. The query *looks* as hard as Step 3 to every circuit-level method, but its tractability is visible in the query, not in the materialised circuit: only the planner's query-derived order recovers it. To see that order made concrete, pick *Compiled d-D circuit* with the *inversion-free* route: the d-DNNF it builds is strikingly **deep** -- a long chain of decisions following the query-derived variable order, the shape of an OBDD rather than the bushy d-DNNF a treewidth-based compiler produces.\n", "\n", "## Step 8: Correlation via `repair_key`\n", "\n", "Finally, the `assignment` table lists, per reviewer, the papers they *could* be assigned to, made mutually exclusive by `repair_key` on `reviewer` (each reviewer ends up on exactly one paper; the schema panel marks `reviewer` with the dotted grouping-key underline). Ask which papers get an assigned reviewer:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, p.title\n", "FROM assignment a JOIN papers p ON a.paper = p.id\n", "GROUP BY p.id, p.title\n", "ORDER BY p.id" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click a result row's `provsql` cell: the circuit carries the mutual-exclusion structure (a reviewer's candidate papers cannot both be true), encoded as `mulinput` gates that share a block key. ProvSQL captures this correlation in the circuit, so the probability accounts for it -- `p1` (covered if Alice, Bob **or** Carol is assigned to it) gets `0.875`, `p2` `0.75` -- and **every** method agrees, including `independent`: its evaluator gives mutually exclusive `mulinput` siblings special treatment (summing their probabilities within a block rather than multiplying), so this kind of block correlation, unlike the non-hierarchical cycle of Step 3, stays tractable without a compiler.\n", "\n", "## Step 9: Recursive Lineage -- Reachability and Reliability\n", "\n", "**Note:**\n", "\n", "> Recursive queries (`WITH RECURSIVE`) require **PostgreSQL ≥ 15**.\n", "\n", "Every query so far has been a single conjunctive block. ProvSQL also tracks provenance through `WITH RECURSIVE` (set semantics): the recursive CTE is transparently evaluated to a fixpoint, and the result carries provenance like any other query -- the provenance of a reachability answer is the disjunction over the paths that reach it. Two more tuple-independent relations exercise the two recursive regimes: `extends(citing, cited)` -- a paper builds on an earlier one -- is a directed **acyclic** citation graph, while `coreview(a, b)` -- two reviewers served on a committee together -- is **symmetric**, hence cyclic.\n", "\n", "**Acyclic data works for any semiring.** Ask what paper `p6` transitively builds on:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "WITH RECURSIVE anc(paper) AS (\n", " SELECT 'p6'\n", " UNION\n", " SELECT e.cited FROM extends e JOIN anc a ON e.citing = a.paper\n", ")\n", "SELECT p.id, p.title\n", "FROM anc JOIN papers p ON anc.paper = p.id\n", "WHERE anc.paper <> 'p6' ORDER BY p.id" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the eval strip, `sr_formula` (with the `extends_label` mapping) shows each ancestor's lineage as the conjunction along the path -- `p1` is reached as `ext(p4,p1) ⊗ ext(p6,p4)` -- and `Marginal probability` returns its value (`p1`: `0.72`, `p5`: `0.70`). Because the graph is acyclic the lineage is read-once per ancestor, and *every* semiring evaluation is available, not just probability.\n", "\n", "**Cyclic data needs Boolean provenance.** `coreview` is symmetric, so \"who is reviewer `r1` connected to?\" walks a cyclic graph:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": { "scheme": "boolean" } }, "source": [ "WITH RECURSIVE conn(node) AS (\n", " SELECT 'r1'\n", " UNION\n", " SELECT e.b FROM coreview e JOIN conn c ON e.a = c.node\n", ")\n", "SELECT r.id, r.name\n", "FROM conn JOIN reviewers r ON conn.node = r.id\n", "WHERE conn.node <> 'r1' ORDER BY r.id" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluate the probability with `boolean_provenance` **off** and the fixpoint never stabilises -- ProvSQL stops with *\"no fixpoint after N rounds (cyclic data?)\"*: under a general semiring a cycle keeps contributing new derivations forever. Turn `boolean_provenance` **on** and the value converges (Boolean provenance is *absorptive*: a longer, cycle-revisiting derivation is absorbed by the shorter one it contains), so the connection probability is well defined. It is exactly **network reliability** -- the probability that `r1` stays connected to the target when each collaboration edge is present independently -- which is $`\\#P`$-hard in general. Here `r1` reaches `r5` with reliability `0.5496`; the circuit is a genuine reliability circuit, so `independent` cannot evaluate it, while `tree-decomposition` and the compilers can.\n", "\n", "This is the same knowledge-compilation story as Step 3, reached through a fixpoint instead of a fixed join pattern: reachability over a probabilistic graph is where the provenance circuit, and a real compiler, are indispensable. Note that cyclic recursion under `boolean_provenance` is sound for absorptive evaluation (probability, Boolean) but not for multiplicity-counting semirings.\n", "\n", "**See also:**\n", "\n", "> - The knowledge-compilation chapter for the full pipeline and every function used here.\n", "> - The chapter on probabilities for the probability methods.\n", "> - The *Probabilistic Databases* synthesis lecture for the theory of safe queries and the dichotomy." ] } ] }