Probabilities ============= ProvSQL computes the probability that a query answer holds in a *probabilistic database* :cite:`DBLP:conf/edbtw/GreenT06`: a database in which every input provenance token carries an independent probability of being present, from which ProvSQL derives the marginal probability of each query answer. Beyond independent inputs it also models **correlated** inputs -- block-independent databases through :sqlfunc:`repair_key` -- and a **continuous** tier, where columns of type ``random_variable`` carry distributions rather than scalars. This chapter starts with the everyday workflow -- assigning input probabilities, evaluating a query, aggregates -- and then turns to reference material: when exact evaluation is tractable, the specialised compilers for hard queries, the explicit method catalogue, and the performance optimisations. Setting input probabilities --------------------------- Assign a probability to each input tuple's provenance token using :sqlfunc:`set_prob`: .. code-block:: postgresql SELECT set_prob(provenance(), 0.8) FROM mytable WHERE id = 1; Or in bulk, from a column of the table itself: .. code-block:: postgresql SELECT set_prob(provenance(), reliability) FROM sightings; Probabilities must be in the range ``[0, 1]``. To read back a stored probability with :sqlfunc:`get_prob`: .. code-block:: postgresql SELECT get_prob(provenance()) FROM mytable; Correlated and block-independent inputs --------------------------------------- By default ProvSQL assigns a fresh, independent provenance token to each base tuple, so correlations between tuples are not modelled. To model correlated probabilities, derive them explicitly with queries: the resulting tables carry correlated tokens. A common case is a *block-independent database* (BID), where tuples are grouped into mutually-exclusive blocks (exactly one tuple per block is assumed to be true). :sqlfunc:`repair_key` sets up provenance to enforce this mutual exclusivity: it takes a table and a key attribute, and makes each group of tuples sharing the same key value into mutually-exclusive alternatives, with the blocks independent of one another. Call :sqlfunc:`repair_key` directly on a table without provenance, as in the example below; it adds the ``provsql`` column itself and is used *instead of* :sqlfunc:`add_provenance`, not after it. .. code-block:: postgresql CREATE TABLE weather(context VARCHAR, weather VARCHAR, ground VARCHAR, p FLOAT); INSERT INTO weather VALUES ('day1', 'rain', 'wet', 0.35), ('day1', 'rain', 'dry', 0.05), ('day1', 'no rain', 'wet', 0.10), ('day1', 'no rain', 'dry', 0.50); -- Make tuples with the same context mutually exclusive SELECT repair_key('weather', 'context'); -- Assign probabilities and evaluate SELECT set_prob(provenance(), p) FROM weather; SELECT ground, ROUND(probability_evaluate(provenance())::numeric, 3) AS prob FROM (SELECT ground FROM weather GROUP BY ground) t; Computing query probabilities ----------------------------- Use :sqlfunc:`probability_evaluate` to evaluate the probability that a query result holds, given the assigned input probabilities: .. code-block:: postgresql SELECT person, probability_evaluate(provenance()) AS prob FROM suspects; With no further argument it returns the **exact** probability. An optional second argument names a computation method and a third passes method-specific parameters (a comma-separated ``key=value`` list, the keys depending on the method; each method also keeps a historical shorthand, a bare sample count or a ``delta;epsilon`` pair). You rarely need them: see :ref:`Choosing a guarantee ` just below, and the full catalogue under :ref:`forcing-a-method`. ProvSQL Studio's :ref:`evaluation strip ` exposes :sqlfunc:`probability_evaluate` interactively, with method and argument selectors. .. _probability-guarantees: Choosing a guarantee, not a method ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **In practice you do not pick a method.** Ask for the *guarantee* you want and let ProvSQL choose how to compute it: - **exact** -- the default: ``probability_evaluate(provenance())`` returns the true probability. - **relative** ``(ε, δ)`` -- ``probability_evaluate(provenance(), 'relative', 'epsilon=0.05,delta=0.01')``: the estimate is within a factor ``1 ± ε`` of the true value with probability ``1 − δ``. The right choice for **rare events** (small probabilities), where an absolute error bound would be meaningless. - **additive** ``(ε, δ)`` -- ``probability_evaluate(provenance(), 'additive', 'epsilon=0.05,delta=0.01')``: the estimate is within ``ε`` of the true value (absolute) with probability ``1 − δ``. A cost-based chooser then picks and runs the cheapest method that meets your request, per query. Three things make this safe to rely on: - The tolerances **nest** (exact ⊂ relative ⊂ additive), so a ``relative`` or ``additive`` request still returns the **exact** value whenever an exact method is cheapest ("exact when cheaper") -- you never pay for approximation you did not need. - The cost of a few methods is hard to predict from the circuit alone, so the chooser runs each optimistic pick under a **budget and escalates automatically** if it turns out slow -- a pathological circuit never hangs on the wrong method. - A ``δ = 0`` (no-failure) approximate request is honoured by a *deterministic* method, not a sampler. Naming a method explicitly is therefore an **escape hatch** -- for forcing a specific algorithm, for ``EXPLAIN``-style understanding, or for the rare case where you know your circuits better than the cost model. The full catalogue, with a summary table of where each method shines, is under :ref:`forcing-a-method`; most users can skip it. Quick bounds without exact evaluation ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When only a coarse estimate is needed, :sqlfunc:`probability_bounds` returns cheap lower and upper bounds on the marginal probability of a monotone-DNF token (as ``OUT`` parameters ``lower`` / ``upper``), without the cost of exact compilation: .. code-block:: postgresql SELECT person, (probability_bounds(provenance())).* FROM suspects; Aggregates: expected values and HAVING -------------------------------------- Expected values of aggregates ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For aggregate queries over a probabilistic table, the :sqlfunc:`expected` function computes the expected value of the aggregate result. It supports ``COUNT``, ``SUM``, ``MIN``, and ``MAX``: .. code-block:: postgresql SELECT dept, expected(COUNT(*)) AS expected_count, expected(SUM(salary)) AS expected_salary FROM employees GROUP BY dept; An optional second argument specifies a provenance condition for computing a *conditional* expectation E[aggregate | condition]. For instance, to compute the expected count within each group conditioned on the group existing (i.e., its provenance being true): .. code-block:: postgresql SELECT dept, expected(COUNT(*), provenance()) AS conditional_count FROM employees GROUP BY dept; Without the second argument, the expectation is unconditional. With it, the result is normalized by the probability of the condition. This :sqlfunc:`expected` ``(aggregate, condition)`` form is the aggregate-specific spelling of the conditioning operator ``|``; see :doc:`conditioning` for the uniform ``A | B`` ("``A`` given ``B``") operator across discrete events, random variables, and aggregates. HAVING with probabilities ^^^^^^^^^^^^^^^^^^^^^^^^^^ ``HAVING`` clauses are supported in the probabilistic setting. The following aggregate functions in ``HAVING`` are handled: ``COUNT``, ``SUM``, ``AVG``, ``MIN``, ``MAX``: .. code-block:: postgresql SELECT dept, probability_evaluate(provenance()) FROM employees GROUP BY dept HAVING COUNT(*) > 2; Arithmetic over these aggregates in ``HAVING`` (including comparisons between two aggregates and integer-division thresholds) is also handled; see :doc:`the aggregation chapter ` for the supported forms and their semantics. For the common ``COUNT`` / ``MIN`` / ``MAX`` / ``SUM`` thresholds a closed-form shortcut keeps the exact call fast; see :ref:`having-shortcuts`. A ``SUM`` (or the ``AVG`` that reduces to one) whose possible values span a very large range cannot be evaluated exactly in reasonable time (the problem is *pseudo*-polynomial: exact cost grows with the magnitude of the values; ``MIN`` and ``MAX`` have a magnitude-independent closed form and are not affected). In that case ask for an approximate answer instead -- a ``relative`` or ``additive`` guarantee: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'relative', 'epsilon=0.05,delta=0.01') FROM orders GROUP BY region HAVING sum(amount_cents) > 100000000; Continuous random variables --------------------------- The discrete-Bernoulli setting above can be combined with a continuous tier: columns of type ``random_variable`` carry distributions (Normal, Uniform, Exponential, Erlang, Categorical, Mixture) rather than scalars, and ``WHERE`` predicates on these columns are rewritten into conditioning events on the row's provenance. Evaluation routes through Monte Carlo by default, with a hybrid evaluator falling back to analytical closed forms where applicable (RangeCheck for support-decidable comparators, exact CDFs for single-distribution ``gate_cmp``, family-closure simplification for linear combinations of normals…). See :doc:`continuous-distributions` for the full surface. .. _tractable-cases: When is exact evaluation tractable? ----------------------------------- Computing the exact probability is :math:`\#P`-hard in general :cite:`DBLP:journals/vldb/DalviS07`, but several structural restrictions make it tractable -- and ProvSQL recognises each and routes to a dedicated mechanism rather than a general-purpose counter. Each row below is a *sufficient* condition for tractability, classified by the shape of the **data**, of its probabilistic **annotation** (TID = tuple-independent, BID = block-independent-disjoint, *correlated* = arbitrary, e.g. view-derived), and of the **query**. The planner-time rewrites and the cost-based chooser apply whichever fits. The query conditions are stated over classes of the relational calculus -- `conjunctive queries `__ (CQ) and unions of them (UCQ) -- which ProvSQL recognises from the structure of ordinary SQL queries. All complexities are **data complexity**: the query is fixed, so its size is not counted. :math:`|D|` is the input size (number of tuples), :math:`k` the treewidth relevant to each row (lineage, data, or joint treewidth), and :math:`e` the number of essential query variables. +-------------+------------+------------------------+-------------------------+-----------------------------------------+----------------------------------------------------------+ | Data | Annotation | Query | Complexity | Source | ProvSQL mechanism | +=============+============+========================+=========================+=========================================+==========================================================+ | any | TID / BID | hierarchical, | :math:`\Theta(|D|)` | :cite:`DBLP:journals/vldb/DalviS07` | :ref:`safe-query rewrite `, then | | | | **self-join-free** CQ | | :cite:`DBLP:journals/jacm/DalviS12` | ``independent`` | +-------------+------------+------------------------+-------------------------+-----------------------------------------+----------------------------------------------------------+ | any | TID | inversion-free UCQ | :math:`O(|D|)` | :cite:`DBLP:conf/icdt/JhaS11` | :ref:`inversion-free certification | | | | (self-joins allowed) | | | `, then ``inversion-free`` | +-------------+------------+------------------------+-------------------------+-----------------------------------------+----------------------------------------------------------+ | any | TID | safe UCQ needing | | :cite:`DBLP:journals/jacm/DalviS12` | :ref:`Möbius compiler `, then | | | | Möbius inversion | :math:`O(|D|^e)` | | the signed Möbius sweep over ``independent`` | | | | (self-join-free) | | | islands | +-------------+------------+------------------------+-------------------------+-----------------------------------------+----------------------------------------------------------+ | any query whose **lineage** over this data and | :math:`2^{O(k)}\,|D|` | :cite:`DBLP:journals/mst/AmarilliCMS20` | in-process :ref:`tree-decomposition | | annotations has treewidth ≤ k | | | ` method | +-------------+------------+------------------------+-------------------------+-----------------------------------------+----------------------------------------------------------+ | treewidth ≤ | TID / BID | recursive reachability | :math:`2^{O(k^2)}\,|D|` | :cite:`DBLP:conf/icalp/AmarilliBS15` | :ref:`reachability compiler `, | | k (treelike)| | | | | then ``independent`` | +-------------+------------+------------------------+-------------------------+-----------------------------------------+----------------------------------------------------------+ | joint treewidth ≤ k of | any UCQ | :math:`2^{O(k^e)}\,|D|` | :cite:`Amarilli2016thesis` (§4.2) | :ref:`joint-width compiler `, then | | the data and its | | | | ``independent`` | | annotation | | | | | +-------------+------------+------------------------+-------------------------+-----------------------------------------+----------------------------------------------------------+ For the exact guarantee, the :ref:`cost-based chooser ` always tries ``independent`` and ``inversion-free`` when their certificate applies (they are cheap and read-once-friendly), and tries ``tree-decomposition`` when it estimates the lineage treewidth low enough to stand a chance. Outside these sufficient conditions, when the lineage is genuinely :math:`\#P`-hard with no structure to exploit, no exact polynomial guarantee remains, and ProvSQL falls back to knowledge compilation (``compilation`` / ``wmc``) for an exact answer or to an FPRAS (``monte-carlo`` / ``karp-luby``) for an approximate one (see :ref:`forcing-a-method`). Specialized routes for hard queries ----------------------------------- For two query families ProvSQL does not evaluate the provenance circuit built along the relational plan at all: it compiles a certified circuit **along a tree decomposition of the data itself**, turning a :math:`\#P`-hard problem into one linear in the data. Both are exact and need no external tool. .. _network-reliability-btw: Network reliability on bounded-treewidth graphs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The first is *two-terminal network reliability*: the probability that a vertex is reachable from a source in a probabilistic graph, following the provenance refinement of Courcelle's theorem :cite:`DBLP:conf/icalp/AmarilliBS15`. This problem is :math:`\#P`-hard in general, but becomes solvable in time *linear in the number of edges* when the graph has bounded treewidth – a property of many real networks (series-parallel and outerplanar networks, transit and utility networks, workflow graphs…). The interface is an ordinary recursive reachability query. Under ``provsql.provenance = 'absorptive'`` or ``'boolean'`` (the compiled circuit is the exact Boolean function of the lineage but only the *absorptive quotient* of the infinite recursive semiring provenance, so it lives in the same regime that already governs recursion on cyclic data; see :ref:`provsql-provenance-class`), the query rewriter recognises the shape .. code-block:: postgresql SET provsql.provenance = 'absorptive'; WITH RECURSIVE reach(node) AS ( SELECT 1 -- the source vertex UNION SELECT e.dst FROM link e JOIN reach r ON e.src = r.node ) SELECT node, probability_evaluate(provenance()) FROM reach WHERE node = 42; over a provenance-tracked base relation ``link`` whose tuples carry probabilities, and compiles – along a tree decomposition of the edge graph – one provenance circuit per reachable vertex, in linear total size. Cyclic graphs are handled natively and the computation is exact; vertex columns of any type work (values are compared as text). Two variations of the shape are recognised as well. *Undirected connectivity* is the natural symmetric traversal: .. code-block:: postgresql WITH RECURSIVE reach(node) AS ( SELECT 1 UNION SELECT CASE WHEN e.src = r.node THEN e.dst ELSE e.src END FROM link e JOIN reach r ON r.node IN (e.src, e.dst) ) SELECT node, probability_evaluate(provenance()) FROM reach; and *deterministic edge filters* – a ``WHERE`` clause over the edge relation's columns alone – restrict which edges participate: .. code-block:: postgresql ... SELECT e.dst FROM link e JOIN reach r ON e.src = r.node WHERE e.capacity >= 10 ... The base arm may also be a relation, ``SELECT v FROM sources`` – a *source set*. When ``sources`` is itself provenance-tracked, each source participates with its tuple's probability (a probabilistic source set: "reachable from some present source"); an untracked relation gives certain sources. A :sqlfunc:`repair_key` source relation is rejected (its tuples are block-correlated, not an independent source set) and the query falls back. Edge relations prepared with :sqlfunc:`repair_key` work too: a block of mutually exclusive alternative edges (say, an uncertain road whose true endpoint is one of several candidates) compiles as a single (k+1)-way deterministic branching, preserving the block-independent semantics exactly. The recursive arm may even join a *derived* edge relation – a subquery or view over several tracked tables. Each derived edge then participates as a compound event (the conjunction of its base tuples), accepted when the derived edges' supports are pairwise disjoint – e.g. a one-to-one join; edges sharing a base tuple are correlated, and the query falls back to the generic evaluation. *Bounded-hop reachability* is recognised as well: a hop-counting CTE whose counter column is seeded by an integer constant, incremented in the recursive arm, and bounded by a (mandatory) ``WHERE`` qual: .. code-block:: postgresql WITH RECURSIVE reach(node, hops) AS ( SELECT 1, 0 UNION SELECT e.dst, r.hops + 1 FROM link e JOIN reach r ON e.src = r.node WHERE r.hops < 4 ) SELECT node, hops, probability_evaluate(provenance()) FROM reach; Row ``(v, h)`` carries the provenance of "some *walk* of exactly ``h`` edges connects the source to ``v``" – walks, not simple paths, matching the recursive fixpoint's semantics: a cycle on the way pumps the achievable lengths, and the compilation (whose states refine from reachability relations to sets of achievable walk lengths) accounts for that exactly, on cyclic data too. Both ``<`` and ``<=`` bounds, either column order, any integer seed, and the undirected, filtered, multi-source and ``repair_key`` variants compose with the counter. The natural follow-up, "which nodes are *within* k hops", obtained by deduplicating the hop column away: .. code-block:: postgresql ... SELECT node FROM reach GROUP BY node; stays on the fast route: the OR of a vertex's per-length tokens is correlated (lengths share edges), but the compilation pre-creates, at the very gate address this deduplication computes, a certified equivalent built from its native within-bound circuit, so :sqlfunc:`probability_evaluate` still settles on the linear exact method. *Cross-vertex aggregations* of a reachability CTE are recognised as well: grouping the reachable vertices by a column of a joined (untracked) member relation: .. code-block:: postgresql ... SELECT t.region FROM reach r JOIN regions t ON r.node = t.node GROUP BY t.region; collapses each group's per-vertex tokens into an OR of *correlated* events (the vertices share edges). The route compiles, per group, the certified circuit of "some member vertex is reachable" (the set-reachability bit folded through the same decomposition DP) and plants it at the gate address the aggregation computes, so the per-region reliability evaluates through the linear certified route. All the groups share one compilation: the tree decomposition and variable analysis are built once, one cheap sweep runs per group, and the parts of the per-group circuits the group's members do not influence come out as the *same* gates (content-deduplicated emission), materialised once. The ``SELECT DISTINCT`` spelling of the same aggregation (``SELECT DISTINCT t.region FROM ...`` with no ``GROUP BY``) is provenance-identical and recognised too; a deterministic filter on the member relation's own columns (``WHERE t.kind = 'hospital'``) is allowed -- it restricts which members each group counts, exactly as an edge-column filter restricts the edges, and is pushed into the member gathering. A tracked member relation, a filter that touches the recursive side, or any other deviation from the join-and-group-by-one-column shape simply skips the planting (the generic evaluation is always available). *K-terminal conjunctions* close the family: a self-join of the CTE with one constant node binding per reference .. code-block:: postgresql ... SELECT 'all supplied' FROM reach r1, reach r2, reach r3 WHERE r1.node = 5 AND r2.node = 6 AND r3.node = 9; asks "are these vertices *all* reachable", and its row provenance is the product of the correlated per-vertex tokens. The route compiles the certified all-members-reachable circuit (a richer congruence: each forgotten terminal pends on the boundary vertices that reach it, the pending sets folding through the same DP) and plants it at the address the conjunction computes, so the query evaluates to the **k-terminal reliability** through the linear certified route -- with joint-worlds semantics: under nonnegative min-plus (see :doc:`semirings`) the same token prices the cheapest covering subgraph, the **directed Steiner cost**, shared edges paid once where the raw product would pay them once per terminal. The emitted circuits are *deterministic and decomposable by construction* (**d-Ds** -- deterministic and decomposable, but not in negation normal form, so not d-DNNFs), and each ``plus`` / ``times`` gate carries a persisted **certificate** of that property (readable with :sqlfunc:`get_infos`). Downstream, the certificate is what makes the tokens cheap: :sqlfunc:`probability_evaluate`'s cost-based chooser settles on the linear exact ``independent`` method (which trusts certified gates the way it trusts read-once structure), and the d-D artefact surface – ``interpret-as-dd`` compilation, :sqlfunc:`ddnnf_stats`, :sqlfunc:`shapley` and :sqlfunc:`banzhaf` – works on them without external compilers. Shapley values of the edge tuples give a principled *edge criticality* analysis of the network: .. code-block:: postgresql SELECT src, dst, shapley(reach_token, provenance()) AS criticality FROM link; The same certified circuits evaluate exactly in every **absorptive semiring**, not just under probability: the deterministic world enumeration surfaces every minimal derivation support – every path – and absorption (:math:`1 \oplus a = 1`) erases the rest, so the value is the image of the absorptive provenance of the recursive query :cite:`DBLP:conf/icdt/DeutchMRT14`. In the nonnegative min-plus semiring this gives **exact min-cost reachability** – single-source shortest distances, on cyclic data too, in time linear in the circuit: .. code-block:: postgresql SELECT node, sr_tropical(provenance(), 'cost_mapping', nonnegative => true) AS min_cost FROM reach; The bounded-hop variant prices walks under a hop budget (a constrained shortest path that plain Dijkstra does not answer directly), and the cross-vertex aggregation gives per-region minima. The other absorptive semirings read the same tokens: the most-reliable path (:sqlfunc:`sr_viterbi`), the widest path (:sqlfunc:`sr_maxmin` over a capacity enum), fuzzy best paths (:sqlfunc:`sr_lukasiewicz`), and *temporal reachability* – when each edge carries a validity multirange, :sqlfunc:`sr_temporal` returns exactly the instants at which the vertex is reachable (see :doc:`temporal`). To keep the unsound evaluations out, the materialised tokens carry the ``'absorptive'`` assumption marker (:sqlfunc:`get_gate_type` reports the root as ``assumed``): counting and why-provenance – genuinely infinite on cyclic recursion – refuse loudly instead of returning a silently wrong value, while probability and the absorptive semirings (see :doc:`semirings`) pass through. When the route cannot apply – the data treewidth exceeds the supported limit (the same cap as the ``tree-decomposition`` method, here applied to the *data* treewidth, which is exactly the tractability assumption), the edge tuples are not independent base tuples, or the CTE deviates from the recognised shape – the query silently falls back to the generic recursive-fixpoint evaluation, preserving its behaviour exactly; set ``provsql.verbose_level`` to at least 10 to get a notice when the fallback fires, or 20 to confirm the compiled route. On a 2×n ladder network (treewidth 2), the integrated route answers exactly over 1,500 probabilistic edges in under 200 ms end to end, and the columnar form compiles 300,000 edges in seconds – where evaluating the equivalent recursive query's lineage crosses the circuit-treewidth cap at a few dozen edges, and the cyclic/undirected case exceeds minutes already at thirty edges. .. _bounded-joint-width: Bounded joint width: hard UCQs over correlated data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The query-side dichotomies – safe-query rewriting and the ``inversion-free`` class – make a *self-join-free hierarchical* or *inversion-free* query tractable, but only over **tuple-independent** inputs, and they give up on the genuinely :math:`\#P`-hard queries: the textbook one is :math:`H_0 = R(x), S(x, y), T(y)`, and behind it the whole hard family :math:`H_k`. ProvSQL evaluates these **exactly** when a different parameter is small – the **joint width**: the treewidth of the data graph *together with* its correlation structure, not of either alone :cite:`Amarilli2016thesis` (§4.2). There are instances whose data graph and whose lineage circuit are *both* of small treewidth yet whose joint width – and hardness – is large, so the bound has to be taken on the joint object (thesis Prop. 4.2.11); when it *is* bounded, the probability is linear in the data, even though the query is :math:`\#P`-hard and the inputs are arbitrarily correlated. Like the reachability route above, the compilation is **data-side**: it runs along a tree decomposition of the data – for correlated inputs, of the data together with the slice of the provenance circuit that carries the correlations – emitting a certified **d-D** by construction, with no external compiler and no knowledge-compilation step. The route is part of the Boolean machinery, so it takes the same opt-in as :ref:`safe-query rewriting `: the ``'boolean'`` provenance class, off by default. Within that class it fires **automatically** – when a conjunctive query the safe-query rewriter declined (an unsafe / :math:`\#P`-hard UCQ) has its *existence* formed (a ``SELECT DISTINCT`` or a ``GROUP BY``), the planner recognises the shape and replaces its provenance with the joint-width compiler's certified d-D, so ``probability_evaluate(provenance())`` returns the exact marginal with no method named: .. code-block:: postgresql SET provsql.provenance = 'boolean'; -- H0 = R(x), S(x, y), T(y): #P-hard, evaluated here per group SELECT t.id, probability_evaluate(provenance()) FROM r, s, t WHERE r.x = s.x AND s.y = t.y GROUP BY t.id; (The ``provsql.joint_width`` GUC, on by default, is only a debug switch to turn the recognition off and compare against the literal circuit.) A ``GROUP BY`` is compiled in a **single pass**: the facts are gathered once, the joint graph is decomposed once, and one bottom-up sweep emits one d-D per group, so a query with many answer groups still pays a single gather + decomposition + sweep. Because the bound is on the joint object, the route stays exact where every query-side method is inapplicable: over **correlated** inputs (:sqlfunc:`repair_key` blocks, view-derived lineage), the one cell of the :ref:`tractability table ` that nothing else fills. When the joint width exceeds the supported cap, or the query is not a recognised UCQ-existence shape, the substitution simply does not fire and the query falls back to the literal circuit and the general chooser; set ``provsql.verbose_level`` to confirm which route ran. :doc:`Case Study 7 `, Step 9, walks a worked example over both independent and :sqlfunc:`repair_key`-correlated reviewing data. .. _forcing-a-method: Forcing a specific method ------------------------- Normally you request a guarantee and the cost-based chooser (:ref:`probability-guarantees`) selects among the methods below. You can also name one explicitly as the second argument of :sqlfunc:`probability_evaluate` -- to force an algorithm, to understand a plan, or when you know your circuits better than the cost model. This table summarises where each shines: .. list-table:: Where each method shines (the chooser picks for you) :header-rows: 1 :widths: 22 18 60 * - Method - Guarantee - Best when (query / provenance circuit) * - ``independent`` - exact - Read-once lineage (self-join-free / hierarchical CQs, each input tuple used at most once) and certified d-D circuits (from the safe-query, reachability and joint-width compilers). Linear time. * - ``inversion-free`` - exact - Safe (inversion-free) UCQs the planner certifies -- linear-time via a structured d-DNNF even with self-joins. * - ``mobius`` - exact - Safe UCQs that are tractable *only* because the :math:`\#P`-hard terms of their inclusion-exclusion expansion cancel (:ref:`Möbius inversion `, the :math:`q_9` / :math:`Q_W` class). Applies to a ``gate_mobius``-rooted token; a linear signed sweep over certified-independent islands. * - ``possible-worlds`` - exact - Very few input tuples (a couple of dozen at most): brute force over all ``2^N`` worlds. * - ``sieve`` - exact - Few clauses: a small monotone-DNF lineage (inclusion-exclusion). * - ``tree-decomposition`` - exact - Low-treewidth lineage -- path-, cycle- or band-shaped join graphs; no external tool needed. * - ``d-tree`` - exact / certified bounds - High-treewidth circuits where ``tree-decomposition`` bails; and the **deterministic** approximate corner -- it returns a certified interval, so it serves a ``δ = 0`` request. * - ``compilation`` (``d4`` / ``c2d`` / …) - exact - Hard lineage with hidden structure a knowledge compiler can exploit; last-resort, needs an external tool (see :doc:`knowledge-compilation`). * - ``wmc`` (``ganak`` / ``sharpsat-td`` / ``dpmc`` / ``weightmc``) - depends on tool - Hard lineage better suited to a weighted model counter than to a d-DNNF compiler; an alternative external-tool route to ``compilation``. Exact for ``ganak`` / ``sharpsat-td`` / ``dpmc``; ``weightmc`` is an approximate ``(ε, δ)`` counter. * - ``monte-carlo`` - additive ``(ε, δ)`` - Any circuit; cheap when the probability is not tiny. * - ``karp-luby`` - relative ``(ε, δ)`` - Rare events (small ``p``) over a DNF, where additive error is uninformative. * - ``stopping-rule`` - relative ``(ε, δ)`` - A universal relative estimator for any circuit -- including random-variable and HAVING-aggregate lineage. Each method in detail: ``'independent'`` Exact computation by a single linear pass that treats each gate as independent. It is correct on **read-once** lineage (each input tuple used at most once) and on **certified d-D circuits** -- the deterministic-and-decomposable circuits the safe-query, reachability and joint-width compilers emit, whose ``plus`` / ``times`` gates carry a certificate of that property which the method trusts the same way it trusts read-once structure. It errors on a circuit that is neither: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'independent') FROM suspects; ``'possible-worlds'`` Exact computation by exhaustive enumeration of all possible worlds. Exponential in the number of provenance tokens; practical only for small circuits: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'possible-worlds') FROM suspects; ``'sieve'`` Exact computation by inclusion-exclusion over the clauses of a monotone-DNF lineage, in time ``O(S × 2^m)`` for ``m`` clauses. The chooser prefers it over ``'possible-worlds'`` when there are fewer clauses than input tuples, and over the compilers when ``m`` is small. It applies only to a DNF-shaped circuit and errors when the clause count exceeds 24: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'sieve') FROM suspects; ``'monte-carlo'`` Approximate computation by random sampling. The third argument is either a fixed sample count (a bare integer or ``samples=N``) or an **additive** ``(ε, δ)`` target ``epsilon=E[,delta=D][,max_samples=M]`` (default ``eps=0.1, delta=0.05`` when omitted): .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'monte-carlo', '10000') FROM suspects; SELECT probability_evaluate(provenance(), 'monte-carlo', 'eps=0.01') FROM suspects; The ``(ε, δ)`` form guarantees that the estimate is within ``ε`` of the true probability ``p`` (in **absolute** terms) with probability at least ``1 − δ``, drawing ``N = ⌈ln(2/δ)/(2ε²)⌉`` samples (Hoeffding's inequality); the count is independent of ``p``. Because the error is *absolute*, an ``ε`` of, say, ``0.1`` is uninformative on a rare-event output with ``p ≪ ε``; for a **relative**-error guarantee in that regime use ``'karp-luby'``. Pin ``provsql.monte_carlo_seed`` for a reproducible estimate. ``'karp-luby'`` Approximate computation by the Karp-Luby fully-polynomial randomised approximation scheme (FPRAS) for ``#DNF`` :cite:`DBLP:journals/jal/KarpLM89`. It delivers a **relative** ``(ε, δ)`` guarantee -- the estimate is within a *factor* ``1 ± ε`` of the true probability with probability at least ``1 − δ`` -- at a sample count independent of that probability. This is the guarantee that stays meaningful on rare-event outputs, where naive Monte Carlo's *absolute* ``ε`` (see ``'monte-carlo'`` above) says nothing. It applies to **DNF-shaped** circuits: a monotone disjunction (top-level ``OR``) of conjunctions (``AND``) of input leaves -- the lineage shape of a union of conjunctive queries over a tuple-independent database. Leaves may be shared across clauses. The method errors (it does not silently fall back) on any other shape: negation (``EXCEPT``/``monus``), comparison (``HAVING``), aggregation, random-variable, or multivalued (BID) gates. The third argument selects a fixed sample count or an ``(ε, δ)`` accuracy target (default ``epsilon=0.1, delta=0.05`` when omitted): - ``samples=N`` (or a bare integer ``N``) -- a fixed number of sampling rounds; deterministic runtime. The rounds are spread across the clauses by *stratified* sampling (each clause gets a share proportional to its probability), which tightens the estimate at a given budget compared with drawing a clause at random each round. - ``epsilon=E`` (alias ``eps=E``) -- relative-error target, served by a *self-adjusting stopping rule*: the method samples only until the estimate is provably within the target, so on outputs whose clauses barely overlap it stops far short of the worst-case ``⌈4(e−2)·m·ln(2/δ)/ε²⌉`` rounds over the ``m`` clauses. - ``delta=D`` -- failure-probability target (only with ``epsilon``). - ``max_samples=N`` -- caps the number of rounds (only with the adaptive path), bounding the runtime for very small ``ε`` or large ``m``; if the cap is hit before the target, the reported guarantee is downgraded to the accuracy actually achieved. .. code-block:: postgresql -- fixed budget SELECT probability_evaluate(provenance(), 'karp-luby', '100000') FROM suspects; -- (ε, δ) guarantee SELECT probability_evaluate(provenance(), 'karp-luby', 'eps=0.05,delta=0.01') FROM suspects; ``samples`` is mutually exclusive with ``epsilon``/``delta``. Pin ``provsql.monte_carlo_seed`` for a reproducible estimate. ``'stopping-rule'`` A universal **relative** ``(ε, δ)`` estimator that runs on the generic circuit, so unlike ``'karp-luby'`` it applies to **any** lineage -- plain Boolean, random-variable, or HAVING-aggregate alike. It samples under an optimal stopping rule, halting as soon as the estimate is provably within the relative target, in ``O(S / (p ε²) · ln(1/δ))`` for an output of probability ``p``. The third argument is the ``(ε, δ)`` target (with an optional ``max_samples`` cap; if the cap is reached first the guarantee degrades from relative to the additive accuracy actually achieved). Pin ``provsql.monte_carlo_seed`` for a reproducible estimate: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'stopping-rule', 'eps=0.05,delta=0.01') FROM suspects; ``'tree-decomposition'`` Exact computation via a tree decomposition of the Boolean circuit :cite:`DBLP:journals/mst/AmarilliCMS20`. Built-in; no external tool required. Fails if the treewidth exceeds the maximum supported value: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'tree-decomposition') FROM suspects; ``'d-tree'`` Anytime **certified-interval** computation :cite:`DBLP:conf/icde/OlteanuHK10`: starting from cheap leaf bounds, it refines the interval by independent-or decomposition (the connected components of the clause graph) and Shannon expansion on the most frequent variable until the interval is narrow enough, or exact (width 0). It fills two corners the other exact methods do not: it returns an exact value where the lineage treewidth **exceeds** ``'tree-decomposition'``'s cap, and -- being *deterministic*, at a cost independent of ``δ`` -- it is the method that honours a ``δ = 0`` (no-failure) approximate request, returning a certified interval rather than a point estimate. It works on any Boolean circuit (a monotone-DNF lineage takes an optimised path). Called by name with no third argument it refines to the exact value; given an accuracy target it stops at a certified interval of that width: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'd-tree') FROM suspects; ``'inversion-free'`` Exact, linear-time computation for the *inversion-free* ``UCQ(OBDD)`` class :cite:`DBLP:conf/icdt/JhaS11` -- hierarchical, tuple-independent queries (self-joins allowed) whose lineage admits a polynomial-size OBDD, where ``'tree-decomposition'`` would blow up. It requires the planner's inversion-free certificate on the provenance root and errors without it. The default strategy already takes this path automatically when the certificate is present, so naming the method is mainly useful for testing; see :ref:`inversion-free-route` for what the certifier accepts: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'inversion-free') FROM suspects; ``'mobius'`` Exact computation for the safe UCQs that need :ref:`Möbius inversion ` -- those tractable only because the :math:`\#P`-hard terms of their inclusion-exclusion expansion cancel (the :math:`q_9` / :math:`Q_W` class). It applies to a token whose root is the signed ``gate_mobius`` combination the planner substitutes for such a query; it is a linear sweep that sums the certified-independent islands' probabilities with the stored integer coefficients (it errors on a token that is not ``gate_mobius``-rooted). This is the *fast* route only: the gate carries the query's literal lineage, so naming **another** method on the same token (``'possible-worlds'``, ``'monte-carlo'``, ...), or asking for ``shapley`` / ``banzhaf``, evaluates that lineage instead and returns the same exact answer (slower). The default strategy already takes the fast Möbius path automatically for a ``gate_mobius``-rooted token, so naming it explicitly is mainly useful for testing: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'mobius') FROM safe_ucq; ``'compilation'`` Exact computation by first compiling the circuit to a d-DNNF using an external tool, then evaluating the d-DNNF. The third argument names the tool: ``'d4'`` (default), ``'d4v2'``, ``'c2d'``, ``'dsharp'``, ``'minic2d'``, or one of the Panini target languages from KCBox :cite:`DBLP:conf/cav/LaiMY25`, ``'panini-obdd'``, ``'panini-obdd-and'`` :cite:`DBLP:journals/jair/LaiLY17`, ``'panini-decdnnf'``: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'compilation', 'd4') FROM suspects; The tool must be installed and accessible in the PostgreSQL server's PATH, or in a directory listed in the ``provsql.tool_search_path`` GUC (see :doc:`configuration`); :sqlfunc:`tool_available` reports the backend's view of a given tool. The CNF handed to the compiler and the resulting d-DNNF can both be inspected; see :doc:`knowledge-compilation`. ``'wmc'`` Weighted model counting (umbrella over several counters); the guarantee depends on the chosen tool -- ``'ganak'`` / ``'sharpsat-td'`` / ``'dpmc'`` are exact, ``'weightmc'`` is an approximate ``(ε, δ)`` counter. The third argument selects the counter and its options as ``tool=[,epsilon=E][,delta=D]`` (the legacy ``tool[;tool_args]`` form is still accepted): ``'ganak'`` :cite:`DBLP:conf/ijcai/SharmaRSM19`, ``'sharpsat-td'`` :cite:`DBLP:conf/cp/KorhonenJ21`, ``'dpmc'`` :cite:`DBLP:conf/cp/DudekPV20`, or ``'weightmc'``. Same PATH / ``provsql.tool_search_path`` considerations as ``'compilation'``: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'wmc', 'ganak') FROM suspects; Default strategy (no second argument) With no method named, ``probability_evaluate(provenance())`` requests the **exact** guarantee, and the cost-based chooser (see :ref:`above `) runs the cheapest exact method applicable to the circuit: typically ``independent`` or ``inversion-free`` for safe queries, ``tree-decomposition`` for low-treewidth lineage, falling back to ``compilation`` with the compiler named by ``provsql.fallback_compiler`` (default ``'d4'``, see :doc:`configuration`) when no in-process method fits. Optimistic picks run under a budget and escalate automatically, so a pathological circuit never hangs on the wrong method. To time every method on one circuit and compare results side by side, use ProvSQL Studio's benchmark panel; see :doc:`studio`. Performance optimizations under the hood ---------------------------------------- Probability evaluation runs through the Boolean-circuit pipeline (``getBooleanCircuit``, then one of the evaluation methods above). Two families of optimisation exploit Boolean-specific structure to make this faster, sometimes by orders of magnitude; both are transparent to the result. .. _safe-query-rewriting: Safe-query rewriting (provenance class ``'boolean'``) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When the provenance class is ``'boolean'`` (``provsql.provenance``, off by default), the planner recognises the *safe* hierarchical conjunctive-query subclass of Dalvi-Suciu :cite:`DBLP:journals/jacm/DalviS12` and rewrites such queries with per-atom ``DISTINCT`` projections so that the resulting provenance circuit is *read-once*. A read-once circuit can be probability-evaluated in linear time by the ``'independent'`` method, instead of falling through to ``'tree-decomposition'`` or external compilation. The rewriter recognises self-join-free hierarchical conjunctive queries over TID or BID base tables, plus a number of extensions that recover safety for query shapes the raw hierarchical criterion would reject (FD-aware reductions driven by primary keys / NOT-NULL UNIQUE constraints, constant selections, transparent deterministic relations, certain self-joins, UCQs with disjoint branches…); see :ref:`safe-query-rewriter` in the developer documentation for the full set. Queries outside the recognised class are passed through unchanged: the GUC enables an opt-in shortcut, never a different result. .. code-block:: postgresql SET provsql.provenance = 'boolean'; SELECT person, ROUND(probability_evaluate(provenance())::numeric, 4) FROM suspects, witnesses WHERE suspects.case_id = witnesses.case_id; **Trade-off.** The rewriter tags the root gate so that semiring evaluators incompatible with Boolean rewriting refuse to run on the result (see :doc:`semirings` for the compatibility list). In practice this means: turn the GUC on for probability-heavy workloads on hierarchical CQs, turn it off (or re-evaluate in a fresh session) before running ``sr_counting``, ``sr_how``, ``sr_why`` on the same circuit. .. _inversion-free-route: Inversion-free certification ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The *inversion-free* ``UCQ(OBDD)`` class of Jha & Suciu :cite:`DBLP:conf/icdt/JhaS11` -- hierarchical, tuple-independent queries whose lineage admits a polynomial-size OBDD -- is a second linear-time route for safe queries, a sibling of the :ref:`safe-query rewrite `. ProvSQL certifies the query, attaches the certificate to the provenance root, and the default chooser takes the route automatically, right after ``'independent'``, so no method need be named. Where ``'tree-decomposition'`` would blow up because the lineage is not low-treewidth, the route builds a *structured* d-DNNF over a query-derived variable order that stays linear in the lineage, and ``'inversion-free'`` reads it in one pass. It differs from the safe-query rewrite on three counts: - **Self-joins.** The inversion-free class natively admits queries that join a relation with itself; the safe-query rewrite targets self-join-free CQs and recovers only limited self-join cases. - **Provenance scheme.** The inversion-free path evaluates the literal lineage unchanged, so it does **not** require the ``'boolean'`` provenance class -- it applies under the default semiring scheme too -- and is governed by its own ``provsql.inversion_free`` GUC (on by default). The safe-query rewrite restructures the query and fires only under ``provsql.provenance = 'boolean'``. - **Edge cases.** In exchange it certifies fewer shapes: the safe-query rewrite recovers safety from functional dependencies and BID blocks (see :ref:`safe-query-rewriting`), which the inversion-free certifier does not -- its atoms must be strictly tuple-independent. (A plain constant selection is fine either way: the certifier treats it as a transparent atom-local filter.) The certifier does let a **non-tracked relation** act as a transparent filter, and **flattens SPJ subqueries and views** before checking the class -- a join inside a view, a view referenced several times (a structured self-join), and views-over-views all reduce to their base atoms first. An aggregating or ``UNION`` view, a correlated subquery, or a query still non-hierarchical after flattening is not certified and falls back to another method. See :ref:`inversion-free-path` in the developer documentation for the full pipeline. .. _safe-ucq-mobius: Möbius inversion for safe UCQs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Some unions of conjunctive queries are *safe* (PTIME data complexity) while being neither hierarchical nor inversion-free: they are tractable only because the :math:`\#P`-hard terms of their inclusion-exclusion expansion carry a zero Möbius coefficient and cancel. The canonical witness is Dalvi & Suciu's :math:`q_9` / :math:`Q_W` :cite:`DBLP:journals/jacm/DalviS12`; the lattice computation follows Dalvi, Schnaitter & Suciu (PODS 2010). Under the ``'boolean'`` provenance class, when the safe-query rewriter and the inversion-free certifier both decline a UCQ-existence shape (a ``SELECT DISTINCT`` / ``GROUP BY`` over a ``UNION``), ProvSQL recognises this class and roots the provenance in a *signed Möbius combination* over read-once islands, which the default probability route evaluates in one linear pass; no method needs to be named. :doc:`Case Study 7 ` runs the complete :math:`q_9` example. Like the safe-query rewrite, this is a shortcut, not a different result: the gate keeps the query's literal lineage as a transparent child, so :sqlfunc:`shapley`, :sqlfunc:`banzhaf`, PROV export, and any *named* probability method (``possible-worlds``, …) answer exactly as on the ordinary provenance, necessarily more slowly, since the literal lineage is the very :math:`\#P`-hard circuit the cancellation sidesteps. Only the default / ``mobius`` probability takes the fast route. The route runs in :math:`O(|D|^e)` (:math:`e` the essential-variable count), so the linear hierarchical and inversion-free routes are tried first; where it applies, it takes precedence over the :ref:`joint-width compiler `, whose success on these queries is not guaranteed. Inputs must be tuple-independent and the UCQ in reduced form (no constants, no bag multiplicity, no overlapping self-join slots); anything else falls back to joint width, then the general chooser. ``provsql.mobius`` (on by default) and the ``provsql.mobius_max_gates`` data-cost cap control the route. .. _having-shortcuts: HAVING closed-form shortcuts ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For the common ``GROUP BY g HAVING op c`` thresholds (``op`` one of ``>=``, ``>``, ``<=``, ``<``, ``=``, ``<>``) ProvSQL computes the group's probability in closed form, replacing the exponential DNF the general HAVING path would build. Each fires automatically when its soundness preconditions hold -- each per-row provenance a single ``gate_input`` leaf, the group-level aggregate not shared with another comparator -- and needs no GUC. **COUNT.** ``COUNT(*) op c`` is recognised as a Poisson-binomial CDF over the per-row Bernoulli indicators and computed directly, in ``O(N × min(C, N−C))`` per group (``N`` the per-group row count). HAVING-COUNT queries that would otherwise hit ``'tree-decomposition'`` or ``'compilation'`` now resolve in milliseconds. A multi-comparator HAVING (``COUNT(*) >= a AND COUNT(*) <= b``) falls through to the general path. **MIN / MAX.** ``MIN(a) op c`` and ``MAX(a) op c`` need no DP: ProvSQL partitions the group's rows on whether their value ``a`` satisfies the comparison against ``c`` and computes the probability as a product of the rows' presence probabilities, in ``O(N)`` per group. For example ``MAX(a) >= c`` holds iff at least one row with ``a >= c`` is present, with probability ``1 − ∏ (1 − p_i)`` over those rows; ``MIN(a) >= c`` holds iff no row with ``a < c`` is present and the group is non-empty. All twelve ``(MIN|MAX, op)`` cases have analogous closed forms. **SUM.** ``SUM(a) op c`` is handled by a weighted-sum dynamic program: the distribution of the group's running sum over the present rows is built by convolution, and the probability is read off as the mass of the sums satisfying the comparison (with the empty group excluded), in ``O(N × R)`` per group with ``R`` the range of reachable sums. Because ``R`` grows with the magnitude of the values the shortcut is *pseudo*-polynomial and steps aside for the general path when the range is too wide; for the usual small-integer weights it replaces the exponential enumeration with a fast DP.