{ "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "provsql-studio", "display_name": "ProvSQL (SQL)", "language": "sql" }, "language_info": { "name": "sql" }, "provsql": { "scheme": "semiring", "database": "cs7", "generated_from": "doc/source/user/casestudy7.rst" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Case Study: Peer-Review Assignment and Knowledge Compilation\n", "\n", "A program chair has a database of uncertain facts about a conference – who bid on what, who is expert in what, which papers are assigned – and keeps asking *probability* questions of it: how likely is it that every paper is competently covered? that this reviewer is conflicted? Each question is a SQL query whose answer carries a probability, and the interesting thing is that the same kind of question can be easy or $\\\\#P$-hard **depending on its exact shape, the schema’s keys, and the data** – and ProvSQL routes each to a different mechanism.\n", "\n", "This case study walks that landscape over one reviewing dataset, driven through [ProvSQL Studio](https://provsql.org/docs/user/studio.html), organised by *where the tractability comes from*:\n", "\n", "- **Part A – the query is safe.** Some questions are tractable whatever the data, because the query (with the schema’s keys) is *safe*. Four steps cover the four ways that happens.\n", "- **Part B – the query is hard.** When no query-side route applies the answer is genuinely $\\\\#P$-hard, and ProvSQL hands it to a knowledge compiler.\n", "- **Part C – the data is well-structured.** A hard *query* can still be easy on *this* data, because ProvSQL compiles along the structure of the data itself.\n", "\n", "## The data\n", "\n", "Several relations carry per-tuple probabilities. Three drive the coverage questions:\n", "\n", "- `bid(reviewer, paper)` – a reviewer offered to review a paper; the probability is how firm the bid is.\n", "- `expertise(reviewer, topic)` – the reviewer’s area of competence.\n", "- `topic_of(paper, topic)` – the paper is about a topic.\n", "\n", "The instance has 14 reviewers, 4 topics and 7 papers. **One modelling choice matters throughout:** `expertise` has a **primary key on** `reviewer` – each reviewer has exactly one area (the functional dependency `reviewer` $\\to$ `topic`). Several reviewers share each area on purpose (five do databases), so a paper’s coverage is genuinely entangled: the same `topic_of` tuple is shared by every co-expert who bid on the paper. The remaining relations – `recommend` / `champion`, an external-review pool, `assignment`, and two citation / collaboration graphs – are introduced where first used.\n", "\n", "## Setup" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*The following cells set up the database with all the content this notebook requires; run them first, ideally on a fresh database.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS bid CASCADE;\n", "DROP TABLE IF EXISTS expertise CASCADE;\n", "DROP TABLE IF EXISTS topic_of CASCADE;\n", "DROP TABLE IF EXISTS extends CASCADE;\n", "DROP TABLE IF EXISTS coreview CASCADE;\n", "DROP TABLE IF EXISTS assignment CASCADE;\n", "DROP TABLE IF EXISTS reviewers CASCADE;\n", "DROP TABLE IF EXISTS papers CASCADE;\n", "DROP TABLE IF EXISTS topics CASCADE;\n", "DROP TABLE IF EXISTS lead_chair, urgent_sub, prescreen, score_pass, flag_pass CASCADE;\n", "DROP TABLE IF EXISTS bid_label, expertise_label, topic_of_label,\n", " extends_label, coreview_label,\n", " lead_chair_label, urgent_sub_label, prescreen_label,\n", " score_pass_label, flag_pass_label;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Dimension tables (deterministic, no provenance).\n", "DROP TABLE IF EXISTS reviewers CASCADE;\n", "CREATE TABLE reviewers (id TEXT PRIMARY KEY, name TEXT NOT NULL);\n", "INSERT INTO reviewers VALUES\n", " ('r1','Alice'), ('r2','Bob'), ('r3','Carol'), ('r4','Dave'),\n", " ('r5','Eve'), ('r6','Frank'), ('r7','Grace'), ('r8','Heidi'),\n", " ('r9','Ivan'), ('r10','Judy'), ('r11','Karl'), ('r12','Lara'),\n", " ('r13','Mona'), ('r14','Nick');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS papers CASCADE;\n", "CREATE TABLE papers (id TEXT PRIMARY KEY, title TEXT NOT NULL);\n", "INSERT INTO papers VALUES\n", " ('p1', 'A Provenance Circuit Calculus'),\n", " ('p2', 'Treewidth Bounds for Safe Queries'),\n", " ('p3', 'Sampling Beats Compilation, Sometimes'),\n", " ('p4', 'Functional Dependencies and Read-Once Lineage'),\n", " ('p5', 'A Dichotomy for Conjunctive Coverage'),\n", " ('p6', 'Knowledge Compilation in Practice'),\n", " ('p7', 'Repair-Key Semantics for Assignments');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS topics CASCADE;\n", "CREATE TABLE topics (id TEXT PRIMARY KEY, name TEXT NOT NULL);\n", "INSERT INTO topics VALUES\n", " ('t1','databases'), ('t2','logic'), ('t3','systems'), ('t4','theory');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- bid(reviewer, paper): the reviewer offered to review the paper; conf\n", "-- is how firm the bid is. Several databases-experts bid on p1, which is\n", "-- what makes p1's coverage interesting.\n", "DROP TABLE IF EXISTS bid CASCADE;\n", "CREATE TABLE bid (\n", " reviewer TEXT NOT NULL REFERENCES reviewers(id),\n", " paper TEXT NOT NULL REFERENCES papers(id),\n", " conf DOUBLE PRECISION NOT NULL,\n", " lbl TEXT,\n", " PRIMARY KEY (reviewer, paper) -- a reviewer bids on a paper once\n", ");\n", "INSERT INTO bid(reviewer, paper, conf) VALUES\n", " ('r1','p1',0.5), ('r1','p4',0.35),\n", " ('r2','p1',0.45),('r2','p2',0.4),\n", " ('r5','p2',0.4), ('r5','p4',0.3),\n", " ('r10','p1',0.4),\n", " ('r13','p2',0.35),('r13','p4',0.45),\n", " ('r3','p1',0.5), ('r3','p3',0.35),('r3','p5',0.4),\n", " ('r6','p3',0.4), ('r6','p5',0.35),\n", " ('r9','p7',0.45),('r9','p3',0.3),\n", " ('r4','p5',0.4),\n", " ('r7','p5',0.35),\n", " ('r11','p6',0.45),('r11','p7',0.35),\n", " ('r14','p6',0.4),\n", " ('r8','p4',0.35);\n", "UPDATE bid SET lbl = format('bid(%s,%s)', reviewer, paper);\n", "SELECT add_provenance('bid');\n", "SELECT set_prob(provenance(), conf) FROM bid;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- expertise(reviewer, topic): the reviewer's area of competence.\n", "-- KEY ON reviewer (not (reviewer, topic)): one area per reviewer. This\n", "-- functional dependency reviewer -> topic is exactly what makes a single\n", "-- paper's coverage query read-once (the safe plan groups on the topic and\n", "-- projects the reviewer out); widen the key to (reviewer, topic) and that\n", "-- query stops being safe. Several reviewers share each area on purpose\n", "-- (five in databases), so a paper's coverage lineage is genuinely\n", "-- entangled (the same topic_of leaf is shared across co-experts).\n", "DROP TABLE IF EXISTS expertise CASCADE;\n", "CREATE TABLE expertise (\n", " reviewer TEXT NOT NULL REFERENCES reviewers(id),\n", " topic TEXT NOT NULL REFERENCES topics(id),\n", " conf DOUBLE PRECISION NOT NULL,\n", " lbl TEXT,\n", " PRIMARY KEY (reviewer)\n", ");\n", "INSERT INTO expertise(reviewer, topic, conf) VALUES\n", " ('r1','t1',0.55),('r2','t1',0.5), ('r5','t1',0.5), ('r10','t1',0.45),('r13','t1',0.45),\n", " ('r3','t2',0.55),('r6','t2',0.5), ('r9','t2',0.45),\n", " ('r4','t3',0.5), ('r7','t3',0.5), ('r11','t3',0.45),('r14','t3',0.45),\n", " ('r8','t4',0.55),('r12','t4',0.5);\n", "UPDATE expertise SET lbl = format('exp(%s,%s)', reviewer, topic);\n", "SELECT add_provenance('expertise');\n", "SELECT set_prob(provenance(), conf) FROM expertise;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- topic_of(paper, topic): the paper is about the topic. Papers overlap\n", "-- on topics (p1, p2, p4, p6 all touch databases), which is what couples\n", "-- their coverage when the paper variable is left free.\n", "DROP TABLE IF EXISTS topic_of CASCADE;\n", "CREATE TABLE topic_of (\n", " paper TEXT NOT NULL REFERENCES papers(id),\n", " topic TEXT NOT NULL REFERENCES topics(id),\n", " conf DOUBLE PRECISION NOT NULL,\n", " lbl TEXT,\n", " PRIMARY KEY (paper, topic) -- a paper lists a topic once\n", ");\n", "INSERT INTO topic_of(paper, topic, conf) VALUES\n", " ('p1','t1',0.6), ('p1','t2',0.55),\n", " ('p2','t1',0.55),('p2','t3',0.5),\n", " ('p3','t2',0.55),('p3','t4',0.45),\n", " ('p4','t1',0.5), ('p4','t4',0.5),\n", " ('p5','t3',0.55),('p5','t2',0.45),\n", " ('p6','t1',0.55),('p6','t3',0.5), ('p6','t4',0.4),\n", " ('p7','t2',0.5), ('p7','t3',0.45);\n", "UPDATE topic_of SET lbl = format('topof(%s,%s)', paper, topic);\n", "SELECT add_provenance('topic_of');\n", "SELECT set_prob(provenance(), conf) FROM topic_of;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- extends(citing, cited): paper `citing` builds on the earlier paper\n", "-- `cited`. ACYCLIC by construction (a paper only extends older ones), so\n", "-- the recursive \"what does p transitively build on?\" query is read-once\n", "-- per ancestor and works for any semiring (used in the recursive section\n", "-- with sr_formula and probability). Requires PostgreSQL 15+ to query.\n", "DROP TABLE IF EXISTS extends CASCADE;\n", "CREATE TABLE extends (\n", " citing TEXT NOT NULL REFERENCES papers(id),\n", " cited TEXT NOT NULL REFERENCES papers(id),\n", " conf DOUBLE PRECISION NOT NULL,\n", " lbl TEXT,\n", " PRIMARY KEY (citing, cited)\n", ");\n", "INSERT INTO extends(citing, cited, conf) VALUES\n", " ('p4','p1',0.8), ('p5','p2',0.7), ('p5','p3',0.6),\n", " ('p6','p4',0.9), ('p6','p5',0.7), ('p7','p3',0.6);\n", "UPDATE extends SET lbl = format('ext(%s,%s)', citing, cited);\n", "SELECT add_provenance('extends');\n", "SELECT set_prob(provenance(), conf) FROM extends;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- coreview(a, b): reviewers a and b have served on a committee together.\n", "-- The relation is SYMMETRIC (both directions are stored), so the\n", "-- collaboration graph is CYCLIC -- the recursive \"who is reviewer r\n", "-- connected to?\" query only terminates under provsql.boolean_provenance,\n", "-- where it computes connection *reliability* (a network-reliability /\n", "-- #P-hard flavour). Requires PostgreSQL 15+ to query.\n", "DROP TABLE IF EXISTS coreview CASCADE;\n", "CREATE TABLE coreview (\n", " a TEXT NOT NULL REFERENCES reviewers(id),\n", " b TEXT NOT NULL REFERENCES reviewers(id),\n", " conf DOUBLE PRECISION NOT NULL,\n", " lbl TEXT,\n", " PRIMARY KEY (a, b)\n", ");\n", "INSERT INTO coreview(a, b, conf) VALUES\n", " ('r1','r2',0.8),('r2','r1',0.8),\n", " ('r2','r3',0.7),('r3','r2',0.7),\n", " ('r1','r3',0.5),('r3','r1',0.5),\n", " ('r3','r4',0.6),('r4','r3',0.6),\n", " ('r4','r5',0.7),('r5','r4',0.7),\n", " ('r2','r5',0.4),('r5','r2',0.4);\n", "UPDATE coreview SET lbl = format('co(%s,%s)', a, b);\n", "SELECT add_provenance('coreview');\n", "SELECT set_prob(provenance(), conf) FROM coreview;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- assignment(reviewer, paper): the candidate papers a reviewer could be\n", "-- assigned to. repair_key on `reviewer` makes the rows for one reviewer\n", "-- mutually exclusive (each reviewer ends up on exactly one paper), so a\n", "-- query over this table carries correlated provenance.\n", "DROP TABLE IF EXISTS assignment CASCADE;\n", "CREATE TABLE assignment (\n", " reviewer TEXT NOT NULL REFERENCES reviewers(id),\n", " paper TEXT NOT NULL REFERENCES papers(id)\n", ");\n", "INSERT INTO assignment(reviewer, paper) VALUES\n", " ('r1','p1'), ('r1','p4'),\n", " ('r2','p1'), ('r2','p2'),\n", " ('r3','p1'), ('r3','p3'),\n", " ('r4','p2'), ('r4','p5');\n", "SELECT repair_key('assignment', 'reviewer');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Inversion-free demo fixture (Step 7). A self-join witness that is NOT\n", "-- read-once yet is inversion-free, so a single global variable order\n", "-- compiles it in linear time while a generic compiler / tree decomposition\n", "-- chokes. Olga (r15) is a prolific bidder who skimmed a 24-paper\n", "-- submission batch (q01..q24); `recommend` and `champion` are two\n", "-- post-review signals (she recommended a paper for acceptance, and would\n", "-- champion it at the PC meeting). The witness query asks for a reviewer\n", "-- whose bids overlap both a recommendation and a championing; grouping on\n", "-- the reviewer shares the bid(r15,*) leaves between the two sides, so the\n", "-- lineage is not read-once but stays inversion-free (root = reviewer, a\n", "-- consistent-unification self-join on `bid`).\n", "--\n", "-- Kept isolated from the core instance so Steps 1-6 and 8 are unchanged:\n", "-- Olga has no `expertise` row and the submission papers carry no\n", "-- `topic_of`, so none of this data reaches the coverage / recursive\n", "-- queries (which all join through expertise / topic_of / the graphs).\n", "DELETE FROM bid WHERE reviewer = 'r15';\n", "DELETE FROM papers WHERE id LIKE 'q%';\n", "DELETE FROM reviewers WHERE id = 'r15';\n", "INSERT INTO reviewers VALUES ('r15','Olga');\n", "INSERT INTO papers SELECT 'q'||to_char(g,'FM00'), format('Submission %s', g)\n", " FROM generate_series(1,24) g;\n", "INSERT INTO bid(reviewer, paper, conf, lbl)\n", " SELECT 'r15', p.id, 0.5, format('bid(r15,%s)', p.id)\n", " FROM papers p WHERE p.id LIKE 'q%';\n", "SELECT set_prob(provenance(), 0.5) FROM bid WHERE reviewer = 'r15';" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS recommend CASCADE;\n", "CREATE TABLE recommend (\n", " reviewer TEXT NOT NULL REFERENCES reviewers(id),\n", " paper TEXT NOT NULL REFERENCES papers(id),\n", " lbl TEXT,\n", " PRIMARY KEY (reviewer, paper)\n", ");\n", "INSERT INTO recommend(reviewer, paper, lbl)\n", " SELECT 'r15', p.id, format('rec(r15,%s)', p.id)\n", " FROM papers p WHERE p.id LIKE 'q%';\n", "SELECT add_provenance('recommend');\n", "SELECT set_prob(provenance(), 0.4) FROM recommend;\n", "SELECT create_provenance_mapping('recommend_label', 'recommend', 'lbl');\n", "ALTER TABLE recommend DROP COLUMN lbl;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "DROP TABLE IF EXISTS champion CASCADE;\n", "CREATE TABLE champion (\n", " reviewer TEXT NOT NULL REFERENCES reviewers(id),\n", " paper TEXT NOT NULL REFERENCES papers(id),\n", " lbl TEXT,\n", " PRIMARY KEY (reviewer, paper)\n", ");\n", "INSERT INTO champion(reviewer, paper, lbl)\n", " SELECT 'r15', p.id, format('champ(r15,%s)', p.id)\n", " FROM papers p WHERE p.id LIKE 'q%';\n", "SELECT add_provenance('champion');\n", "SELECT set_prob(provenance(), 0.3) FROM champion;\n", "SELECT create_provenance_mapping('champion_label', 'champion', 'lbl');\n", "ALTER TABLE champion DROP COLUMN lbl;" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Möbius-cancellation demo fixture (Part A, \"safe by cancellation\"). The\n", "-- canonical Dalvi-Suciu witness q9 / QW: a UNION of conjunctive queries\n", "-- that is SAFE -- PTIME data complexity -- yet only because the #P-hard\n", "-- term of its inclusion-exclusion expansion carries a zero Möbius\n", "-- coefficient and cancels. No circuit-level method keeps up (q9 has no\n", "-- polynomial-size OBDD / FBDD / d-DNNF), and on dense data its *joint*\n", "-- treewidth is unbounded too, so the joint-width route of Part C declines\n", "-- and the Möbius route -- which reads the structure off the QUERY, not the\n", "-- data -- is the only exact one. That is why this fixture must be DENSE:\n", "-- on a sparse instance the joint width is bounded and Part C's compiler\n", "-- would handle it, hiding the route this step exists to show.\n", "--\n", "-- Dressed as a second-tier external review pool, isolated from the core\n", "-- instance (its own c*/e* domain, no foreign keys, so it reaches none of\n", "-- the coverage / recursive queries): four area chairs (c1..c4) each run\n", "-- three independent assessment passes -- a prescreen, a score and a flag --\n", "-- over four embargoed submissions (e1..e4), COMPLETE bipartite so the joint\n", "-- width genuinely grows; `lead_chair` marks the senior chairs and\n", "-- `urgent_sub` the time-critical submissions. Mapped onto q9's\n", "-- R(x), S1(x,y), S2(x,y), S3(x,y), T(y) as\n", "-- lead_chair / prescreen / score_pass / flag_pass / urgent_sub. The union\n", "-- query has no tidy English gloss -- its safety is purely structural -- and\n", "-- that is exactly the point. All tuples carry probability 0.1, low enough\n", "-- that the exact answer stays away from saturation.\n", "DROP TABLE IF EXISTS lead_chair, urgent_sub, prescreen, score_pass, flag_pass CASCADE;\n", "CREATE TABLE lead_chair (chair TEXT NOT NULL, conf DOUBLE PRECISION, lbl TEXT, PRIMARY KEY(chair));\n", "CREATE TABLE urgent_sub (sub TEXT NOT NULL, conf DOUBLE PRECISION, lbl TEXT, PRIMARY KEY(sub));\n", "CREATE TABLE prescreen (chair TEXT, sub TEXT, conf DOUBLE PRECISION, lbl TEXT, PRIMARY KEY(chair, sub));\n", "CREATE TABLE score_pass (chair TEXT, sub TEXT, conf DOUBLE PRECISION, lbl TEXT, PRIMARY KEY(chair, sub));\n", "CREATE TABLE flag_pass (chair TEXT, sub TEXT, conf DOUBLE PRECISION, lbl TEXT, PRIMARY KEY(chair, sub));\n", "INSERT INTO lead_chair(chair, conf) SELECT 'c'||g, 0.1 FROM generate_series(1,4) g;\n", "INSERT INTO urgent_sub(sub, conf) SELECT 'e'||g, 0.1 FROM generate_series(1,4) g;\n", "INSERT INTO prescreen(chair, sub, conf)\n", " SELECT 'c'||i, 'e'||j, 0.1 FROM generate_series(1,4) i, generate_series(1,4) j;\n", "INSERT INTO score_pass(chair, sub, conf)\n", " SELECT 'c'||i, 'e'||j, 0.1 FROM generate_series(1,4) i, generate_series(1,4) j;\n", "INSERT INTO flag_pass(chair, sub, conf)\n", " SELECT 'c'||i, 'e'||j, 0.1 FROM generate_series(1,4) i, generate_series(1,4) j;\n", "UPDATE lead_chair SET lbl = format('lead(%s)', chair);\n", "UPDATE urgent_sub SET lbl = format('urgent(%s)', sub);\n", "UPDATE prescreen SET lbl = format('pre(%s,%s)', chair, sub);\n", "UPDATE score_pass SET lbl = format('score(%s,%s)', chair, sub);\n", "UPDATE flag_pass SET lbl = format('flag(%s,%s)', chair, sub);\n", "SELECT add_provenance('lead_chair'); SELECT add_provenance('urgent_sub');\n", "SELECT add_provenance('prescreen'); SELECT add_provenance('score_pass');\n", "SELECT add_provenance('flag_pass');\n", "SELECT set_prob(provenance(), conf) FROM lead_chair;\n", "SELECT set_prob(provenance(), conf) FROM urgent_sub;\n", "SELECT set_prob(provenance(), conf) FROM prescreen;\n", "SELECT set_prob(provenance(), conf) FROM score_pass;\n", "SELECT set_prob(provenance(), conf) FROM flag_pass;\n", "SELECT create_provenance_mapping('lead_chair_label', 'lead_chair', 'lbl');\n", "SELECT create_provenance_mapping('urgent_sub_label', 'urgent_sub', 'lbl');\n", "SELECT create_provenance_mapping('prescreen_label', 'prescreen', 'lbl');\n", "SELECT create_provenance_mapping('score_pass_label', 'score_pass', 'lbl');\n", "SELECT create_provenance_mapping('flag_pass_label', 'flag_pass', 'lbl');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- Label mappings so the Studio eval-strip's sr_formula / sr_why / sr_how\n", "-- and PROV-XML export name the leaves instead of showing raw UUIDs. The\n", "-- `lbl` columns exist only to feed create_provenance_mapping, which copies\n", "-- their values into standalone (value, provenance) tables.\n", "SELECT create_provenance_mapping('bid_label', 'bid', 'lbl');\n", "SELECT create_provenance_mapping('expertise_label','expertise','lbl');\n", "SELECT create_provenance_mapping('topic_of_label', 'topic_of', 'lbl');\n", "SELECT create_provenance_mapping('extends_label', 'extends', 'lbl');\n", "SELECT create_provenance_mapping('coreview_label', 'coreview', 'lbl');" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- A single `label` mapping: the union of every per-tuple label above, so a\n", "-- text-based semiring (sr_formula / sr_why / sr_how) can name the leaves of\n", "-- a query spanning several relations through one mapping argument, rather\n", "-- than picking the relation-specific mapping each time.\n", "DROP TABLE IF EXISTS label CASCADE;\n", "CREATE TABLE label AS\n", " SELECT value, provenance FROM bid_label\n", " UNION ALL SELECT value, provenance FROM expertise_label\n", " UNION ALL SELECT value, provenance FROM topic_of_label\n", " UNION ALL SELECT value, provenance FROM extends_label\n", " UNION ALL SELECT value, provenance FROM coreview_label\n", " UNION ALL SELECT value, provenance FROM recommend_label\n", " UNION ALL SELECT value, provenance FROM champion_label\n", " UNION ALL SELECT value, provenance FROM lead_chair_label\n", " UNION ALL SELECT value, provenance FROM urgent_sub_label\n", " UNION ALL SELECT value, provenance FROM prescreen_label\n", " UNION ALL SELECT value, provenance FROM score_pass_label\n", " UNION ALL SELECT value, provenance FROM flag_pass_label;\n", "CREATE INDEX ON label(provenance);" ], "outputs": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "-- `conf` only fed set_prob and `lbl` only fed create_provenance_mapping;\n", "-- both jobs are now done, so drop the artifact columns from the base tables.\n", "ALTER TABLE bid DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE expertise DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE topic_of DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE extends DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE coreview DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE lead_chair DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE urgent_sub DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE prescreen DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE score_pass DROP COLUMN conf, DROP COLUMN lbl;\n", "ALTER TABLE flag_pass DROP COLUMN conf, DROP COLUMN lbl;" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The script seeds each tuple’s probability and tags every uncertain relation in Studio’s schema panel with a `prov-tid` pill (tuple-independent), or `prov-bid` for the block-correlated `assignment`. The provenance class is the `Boolean` / `Absorptive` / `Semiring` / `Where` toggle (see `studio-query-toggles`): most steps want `Boolean`, where ProvSQL’s safe-query rewriter and data compilers are active; `Semiring` turns them off and shows the literal circuit; `Absorptive` is needed only for the cyclic recursion at the very end.\n", "\n", "## Part A: The Query Is Safe\n", "\n", "A *safe* query is one whose exact probability is PTIME in the data – whatever the data looks like. ProvSQL recognises safety at planning time and answers without any compiler. There are exactly four ways a (self-join-free) query can be safe, and the four steps below are one of each.\n", "\n", "### Safe by shape\n", "\n", "*We need a coverage shortlist: which papers have at least one plausibly qualified reviewer – someone who bid on the paper and has some area of expertise?*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, p.title\n", "FROM bid b, expertise e, papers p\n", "WHERE b.reviewer = e.reviewer AND b.paper = p.id\n", "GROUP BY p.id, p.title\n", "ORDER BY p.id" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Click into `p1`’s `provsql` cell and, in the eval strip, pick `Marginal probability` with the `independent` method. It returns `≈ 0.666` instantly – and *Compiled d-D circuit* with *interpret as d-D*, or *Tree decomposition*, all agree.\n", "\n", "Why it works: the query is **hierarchical** – the atoms mentioning `topic` (just `expertise`) sit inside those mentioning `reviewer` (`bid` and `expertise`) – so a paper’s coverage is an `OR` over reviewers of independent terms, with no tuple shared. The circuit is **read-once**, and `independent` is exact on read-once circuits. This is the easiest corner: safe by the query’s shape alone, no key needed.\n", "\n", "### Safe by a key\n", "\n", "*Now the question that matters for assignment: is paper* `p1` *competently covered – did someone bid on it who is expert in one of* `p1`\\*’s own topics?\\*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT DISTINCT 1\n", "FROM bid b, expertise e, topic_of t\n", "WHERE b.reviewer = e.reviewer\n", " AND e.topic = t.topic\n", " AND b.paper = 'p1' AND t.paper = 'p1'" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try `independent` with the toggle on `Semiring`: it **errors** – *“Not an independent circuit”*. Switch the toggle to `Boolean` and try again: it succeeds, `≈ 0.4259`, matching *Tree decomposition* exactly.\n", "\n", "The difference is the **key**. This query is non-hierarchical – `bid` mentions only `reviewer`, `topic_of` only `topic`, `expertise` both – so its literal lineage reuses the shared `topic_of(p1, t1)` tuple (Alice, Bob and Judy are all database experts who bid on `p1`) and is *not* read-once. But because `expertise` is keyed on `reviewer`, each reviewer has a single topic, so the safe-query rewriter can group the experts by topic and factor that shared tuple out:\n", "\n", "$$\n", "\\bigvee_{t} \\\\; \\mathit{topic\\\\_of}(p_1, t) \\\\;\\wedge\\\\;\n", " \\Bigl(\\bigvee_{r:\\\\,\\mathit{exp}(r,t)} \\mathit{bid}(r, p_1) \\wedge\n", " \\mathit{exp}(r, t)\\Bigr).\n", "$$\n", "\n", "Each leaf now appears once – read-once again, and `independent` is exact. The lesson: **safety depends on the query and the keys together**. Drop the key (`ALTER TABLE expertise DROP CONSTRAINT expertise_pkey`) and `independent` still returns `0.4259`, but the route changes – the query is now genuinely hard and falls through to Part C’s data compiler. Add it back before continuing:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "ALTER TABLE expertise ADD PRIMARY KEY (reviewer);" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Safe by a query-derived order\n", "\n", "For this step the fixture adds **Olga** (`r15`), a prolific reviewer who skimmed a 24-paper batch, with two post-review signals: `recommend` (she recommended a paper) and `champion` (she would champion it at the meeting).\n", "\n", "*Which reviewers have bids that back up both a recommendation and a championing – a sign they engaged deeply?*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT r.id, r.name\n", "FROM bid b1, recommend a, bid b2, champion c, reviewers r\n", "WHERE b1.reviewer = a.reviewer AND b1.paper = a.paper\n", " AND b1.reviewer = b2.reviewer\n", " AND b2.reviewer = c.reviewer AND b2.paper = c.paper\n", " AND b1.reviewer = r.id\n", "GROUP BY r.id, r.name" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On Olga’s row, try the heavy methods first: `tree-decomposition` gives up (*“Treewidth greater than 10”*), `possible-worlds` refuses (over 64 inputs), and a real compiler (`d4`) is cut off by `statement_timeout` on the 24-paper instance. Then try `inversion-free` – or just the default method – and it returns `0.975314` in milliseconds.\n", "\n", "Why the gap: grouping on the reviewer makes the two evidence sides share the `bid(r15, *)` tuples, so the lineage is **not** read-once and every circuit-level method treats it as hard. But the query is a *consistent-unification self-join* with a single root variable, which is the **inversion-free** condition: it admits a linear-size OBDD over a variable order *read from the query*. ProvSQL finds that order at planning time (the teal `IF` badge on the root is the certificate) and builds a deep, chain-like d-DNNF – tractability that is invisible in the materialised circuit and visible only in the query.\n", "\n", "### Safe by cancellation (Möbius inversion)\n", "\n", "The fourth corner is the subtlest, and it needs its own little world: an **external-review pool**, isolated from the main instance, where four area chairs (`c1`–`c4`) each run three independent assessment passes – a *prescreen*, a *score* and a *flag* – over four embargoed submissions (`e1`–`e4`), with `lead_chair` marking senior chairs and `urgent_sub` the time-critical submissions.\n", "\n", "*Is the pool in a “well-attended” state?* – a union of four overlapping patterns of who-assessed-what. This is the textbook query $q_9$ / $Q_W$ (Dalvi & Suciu); its four patterns have no tidy English gloss, because its tractability is *purely structural* – which is exactly the point." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT 1 FROM lead_chair r, prescreen a1, flag_pass a3, urgent_sub t3\n", " WHERE r.chair = a1.chair AND a3.sub = t3.sub\n", "UNION\n", "SELECT 1 FROM prescreen b1, score_pass b2, flag_pass b3, urgent_sub tb\n", " WHERE b1.chair = b2.chair AND b1.sub = b2.sub AND b3.sub = tb.sub\n", "UNION\n", "SELECT 1 FROM score_pass c2, flag_pass c3, flag_pass c3b, urgent_sub tc\n", " WHERE c2.chair = c3.chair AND c2.sub = c3.sub AND c3b.sub = tc.sub\n", "UNION\n", "SELECT 1 FROM lead_chair d, prescreen d1, prescreen d1b,\n", " score_pass d2, score_pass d2b, flag_pass d3\n", " WHERE d.chair = d1.chair AND d1b.chair = d2.chair AND d1b.sub = d2.sub\n", " AND d2b.chair = d3.chair AND d2b.sub = d3.sub" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try `Marginal probability` with the toggle on `Boolean`: it returns the exact `0.056923` with no method named (the chooser routes it through the Möbius compiler automatically; once the μ root is rendered you can also pick the `mobius` method explicitly). Try any circuit compiler instead and it blows up: $q_9$ provably has no polynomial OBDD / FBDD / decision-DNNF.\n", "\n", "Why it is nonetheless safe: writing the probability by inclusion-exclusion, the one $\\\\#P$-hard term – the conjunction of all four patterns – gets a **zero Möbius coefficient** and cancels, leaving only easy terms. ProvSQL’s Möbius compiler computes exactly that signed combination. Click the existence row’s `provsql` cell: the circuit is large – the **μ** (Möbius-function) root carries the whole literal lineage as a transparent child, so Studio shows a *Circuit too large* card – choose `Render at depth 1` and the root is that single **μ** gate, each child edge labelled with its integer coefficient, the hard term among them cancelled to zero. (The pool is **dense** on purpose: on sparse data Part C’s compiler would also handle it and hide the point. Like every Part A route, the gate keeps the literal lineage, so `shapley` and `sr_formula` still work on it.)\n", "\n", "### The four routes at a glance\n", "\n", "These four steps are the complete set of exact query-side routes – the Dalvi-Suciu dichotomy made operational:\n", "\n", "| The query is safe because… | ProvSQL route | Witness in this Part |\n", "|----|----|----|\n", "| its **shape** is hierarchical | read-once lineage → `independent` (no rewrite) | coverage, all papers |\n", "| a **key** makes it read-once | safe-query rewrite (FD) | `p1` competent coverage |\n", "| a query-derived **order** | inversion-free certificate | Olga’s bid self-join |\n", "| its $\\\\#P$-hard term **cancels** | Möbius compiler | the review pool ($q_9$) |\n", "\n", "All four are PTIME, need no external tool, and assume tuple-independent inputs. When none applies, the query is genuinely hard (Part B) – unless the data rescues it (Part C). See the full tractability table.\n", "\n", "## Part B: The Query Is Hard\n", "\n", "Ask the competent-coverage question of the **whole program** instead of one paper and every Part A route fails. This Part follows that hard query through knowledge compilation.\n", "\n", "### The hard query, and what a compiler does with it\n", "\n", "*Is* **any** *paper competently covered?*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT DISTINCT 1\n", "FROM bid b, expertise e, topic_of t\n", "WHERE b.reviewer = e.reviewer\n", " AND e.topic = t.topic\n", " AND t.paper = b.paper" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set the toggle to `Semiring` so no Boolean shortcut fires, and the provenance is the literal circuit of the cyclic join – Studio draws it, visibly bushy. Try `independent`: it **errors**. Try `tree-decomposition` or a compiler: `≈ 0.8818`.\n", "\n", "Why it is hard: with `paper` free, `reviewer`, `paper` and `topic` form a cycle with no nesting – non-hierarchical, not inversion-free, and a single conjunct so nothing cancels. All of Part A is exhausted, and the probability is $\\\\#P$-hard. The circuit’s treewidth is **4** (against **1** for a safe query), so a real compiler (*Compiled d-D circuit* with `d4`) turns it into a d-DNNF of order a thousand nodes – and the number `0.8818`.\n", "\n", "That compiled circuit is the object the rest of Part B inspects.\n", "\n", "### Reading the CNF back against the data\n", "\n", "Pick *Tseytin CNF*: the panel shows the DIMACS CNF ProvSQL streams to an external compiler, with one `c input` comment per variable. Studio annotates each with the source tuple it stands for, so a model or weighted count returned by an outside tool reads back against the reviewing data (the same mapping is a table through [`tseytin_cnf_mapping`](https://provsql.org/doxygen-sql/html/group__provenance__output.html#ga22a56449dce072ad0bdabbb526096cd6)).\n", "\n", "### Comparing compilers and methods\n", "\n", "Pick *Probability benchmark*: it times every probability method on the hard circuit, one row each, with the compiled d-DNNF size beside the run time. Observe the spread: the exact methods that finish (`tree-decomposition`, the compilers, the model counters) all agree to full precision; `monte-carlo` lands in its confidence band; `independent` shows its error; and methods that do not scale – `possible-worlds` enumeration – hit the `statement_timeout`. [`ddnnf_stats`](https://provsql.org/doxygen-sql/html/group__provenance__output.html#gaf04888a5ff2ef7668754b5a4a2c1bf3a) exposes the same sizes as `jsonb` for comparing one circuit across compilers.\n", "\n", "Which compilers appear depends on what is installed; the **Tools** panel (the wrench icon) lists the registry and flags each as available / enabled, read live from `provsql.tools` (see [the external-tool registry](https://provsql.org/docs/user/tool-registry.html)).\n", "\n", "### A shortcut that skips the compiler\n", "\n", "Not every hard-looking query reaches a compiler. *Which papers have at least two bidding experts?*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, p.title\n", "FROM bid b, expertise e, papers p\n", "WHERE b.reviewer = e.reviewer AND b.paper = p.id\n", "GROUP BY p.id, p.title\n", "HAVING count(*) >= 2\n", "ORDER BY p.id" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the probability for `p1`: it comes back at once, and ProvSQL emits a NOTICE that the `count(*) >= 2` comparison gate was *shortcut*. The `HAVING` threshold over independent contributors is a [Poisson-binomial](https://en.wikipedia.org/wiki/Poisson_binomial_distribution) distribution, which [`probability_evaluate`](https://provsql.org/doxygen-sql/html/group__probability.html#gad377e94cb1fff57141b1950cc4169c5e) folds in closed form – replacing the whole provenance with one Bernoulli gate before any compiler runs, so even `independent` answers it.\n", "\n", "## Part C: The Data Is Well-Structured\n", "\n", "Part B’s whole-program coverage is $\\\\#P$-hard in its *shape* – yet on *this* data it is tractable. When no query-side route applies, ProvSQL still avoids a compiler if the **data** is well-structured: it compiles a certified circuit along a tree decomposition of the data itself, exactly and in linear time – over independent data, correlated data, and through recursion.\n", "\n", "### The same hard query, now easy\n", "\n", "Set the toggle back to `Boolean` and re-run Part B’s hard query:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT DISTINCT 1\n", "FROM bid b, expertise e, topic_of t\n", "WHERE b.reviewer = e.reviewer AND e.topic = t.topic AND t.paper = b.paper" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now `independent` **succeeds**, the same `≈ 0.8818` – and no compiler ran. The query is still $\\\\#P$-hard in shape, but on this data the **joint treewidth** (the data graph together with its correlations) is bounded, so ProvSQL’s **joint-width compiler** Amarilli2016thesis recognises the shape and compiles the provenance along a tree decomposition of the *data* into a certified d-D that `independent` reads in linear time. This is the route the dropped-key query of Part A fell through to.\n", "\n", "Read it per paper – *for each paper, how competently is it covered?*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT t.paper\n", "FROM bid b, expertise e, topic_of t\n", "WHERE b.reviewer = e.reviewer AND e.topic = t.topic AND t.paper = b.paper\n", "GROUP BY t.paper ORDER BY t.paper" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Marginal probability` gives each group’s value – `p1` `0.425869`, `p4` `0.300776` – matching a compiler to full precision, from the data’s structure with no external tool.\n", "\n", "### Correlated inputs\n", "\n", "So far every relation was tuple-independent. The `assignment` table lists, per reviewer, the papers they *could* be assigned to, made mutually exclusive by `repair_key` on `reviewer` (each reviewer ends up on one paper). *Which papers get an assigned reviewer?*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT p.id, p.title\n", "FROM assignment a JOIN papers p ON a.paper = p.id\n", "GROUP BY p.id, p.title\n", "ORDER BY p.id" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try `independent` – it **agrees** with the exact methods (`p1` `0.875`, `p2` `0.75`). The circuit encodes the mutual exclusion as `mulinput` gates sharing a block key, and `independent`’s evaluator gives mutually exclusive siblings special treatment (it *sums* their probabilities within a block instead of multiplying). So this kind of block correlation, unlike the cycle of Part B, stays tractable with no compiler at all – because the *query* here is safe; only the *inputs* are correlated.\n", "\n", "### Hard *and* correlated\n", "\n", "The joint-width route earns its keep where the two regimes meet. *Is any paper covered by its assigned expert reviewer?* – Part B’s hard cyclic shape, now over the correlated `assignment` table." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "SELECT DISTINCT 1\n", "FROM assignment a, expertise e, topic_of t\n", "WHERE a.reviewer = e.reviewer AND e.topic = t.topic AND t.paper = a.paper" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Try `independent` with the toggle on `Semiring` (the literal circuit): it **rejects** it (a reviewer’s candidate papers are mutually exclusive, so the lineage is neither independent nor read-once), and Part A’s routes do not apply to the cyclic shape. Switch the toggle back to `Boolean` and the joint-width route compiles it anyway: the joint treewidth – data graph *plus* the `repair_key` exclusion blocks – is bounded, so ProvSQL builds a certified d-D (each block stick-broken into shared independent events) that `independent` evaluates to the exact `0.735868`. This is the one cell of the tractability table nothing else fills: $\\\\#P$-hard *and* correlated, exact and linear in the data.\n", "\n", "### Recursion: reachability and reliability\n", "\n", "**Note:**\n", "\n", "> Recursive queries (`WITH RECURSIVE`) require **PostgreSQL ≥ 15**.\n", "\n", "ProvSQL tracks provenance through `WITH RECURSIVE` too: a reachability answer’s provenance is the disjunction over the paths that reach it. Two more relations exercise the two regimes: `extends(citing, cited)` is an **acyclic** citation graph, `coreview(a, b)` a **symmetric** (hence cyclic) collaboration graph.\n", "\n", "*What does paper* `p6` *transitively build on?*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": {} }, "source": [ "WITH RECURSIVE anc(paper) AS (\n", " SELECT 'p6'\n", " UNION\n", " SELECT e.cited FROM extends e JOIN anc a ON e.citing = a.paper\n", ")\n", "SELECT p.id, p.title\n", "FROM anc JOIN papers p ON anc.paper = p.id\n", "WHERE anc.paper <> 'p6' ORDER BY p.id" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`sr_formula` shows each ancestor’s lineage as the conjunction along the path (`p1` is `ext(p4,p1) ⊗ ext(p6,p4)`) and `Marginal\n", "probability` gives its value (`p1` `0.72`). Because the graph is acyclic the lineage is read-once per ancestor, so *every* semiring evaluation works, not just probability.\n", "\n", "*Who is reviewer* `r1` *connected to through co-reviewing?* – now a cyclic walk:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "provsql": { "scheme": "absorptive" } }, "source": [ "WITH RECURSIVE conn(node) AS (\n", " SELECT 'r1'\n", " UNION\n", " SELECT e.b FROM coreview e JOIN conn c ON e.a = c.node\n", ")\n", "SELECT r.id, r.name\n", "FROM conn JOIN reviewers r ON conn.node = r.id\n", "WHERE conn.node <> 'r1' ORDER BY r.id" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the toggle on `Semiring` (the default) the fixpoint never stabilises – *“no fixpoint after 1000 rounds (cyclic data?)”* – because a cycle keeps producing new derivations. Switch the toggle to `Absorptive` and it converges: $1 \\oplus a = 1$, so a longer cycle-revisiting path is absorbed by the shorter one inside it, and the fixpoint is the set of minimal paths.\n", "\n", "The probability is then **two-terminal network reliability** – that `r1` stays connected when each edge is present independently – which is $\\\\#P$-hard in general. Yet `Marginal probability` reads `r1` reaches `r5` with reliability `0.5496` straight off, and even `independent` evaluates it. Why: ProvSQL recognises the reachability shape and, by the provenance form of Courcelle’s theorem, compiles along a tree decomposition of the **collaboration graph itself** into a certified d-D, one per reachable vertex, linear in the edges when the graph has bounded treewidth (see `network-reliability-btw`). That is the case study in miniature: Part B’s hardness lived in the *query* and needed a compiler; here – as throughout Part C – it is dissolved by the structure of the *data*, with no external tool. (The `'absorptive'` marker on these tokens makes multiplicity-counting or why-provenance – genuinely infinite on cycles – refuse rather than return an unjustified value.)\n", "\n", "**See also:**\n", "\n", "> - [The knowledge-compilation chapter](https://provsql.org/docs/user/knowledge-compilation.html) for the full pipeline and every function used here.\n", "> - [The chapter on probabilities](https://provsql.org/docs/user/probabilities.html) for the probability methods and the `tractability table\n", "> ` that organises this case study.\n", "> - The *Probabilistic Databases* synthesis lecture for the theory of safe queries and the dichotomy." ] } ] }