.. _knowledge-compilation: Knowledge Compilation ===================== ProvSQL builds **provenance circuits** from SQL queries transparently, through its planner hook. When Boolean provenance is required, whether for probability evaluation, for Shapley value computation, or simply at the user's request, a Boolean circuit of a particularly convenient form is obtained. This chapter follows the full pipeline behind :doc:`probability evaluation ` and shows how to inspect every intermediate artifact: .. code-block:: text SQL query → provenance circuit → CNF → d-DNNF → probability Each arrow is exposed by a SQL function, and each artifact can be rendered as `GraphViz DOT `_ or printed as `DIMACS `_ text. The same surfaces drive the knowledge-compilation panels of :doc:`ProvSQL Studio `. .. note:: Because this chapter is entirely about knowledge compilation over Boolean provenance, you will usually want ProvSQL's Boolean-only optimisations switched on for the whole session. They are gated behind the :ref:`provsql.boolean_provenance ` GUC (off by default): issue ``SET provsql.boolean_provenance = on;``, or pick :guilabel:`Boolean` on ProvSQL Studio's :guilabel:`Provenance scheme` switch next to the query box. The circuit ----------- Every probabilistic query result carries a provenance token (a UUID) naming the root of its provenance circuit (see :doc:`provenance-tables` and :doc:`querying`). The circuit is a DAG over ``input`` leaves and ``plus`` / ``times`` / ``monus`` gates. :sqlfunc:`view_circuit` renders it as DOT; Studio's :ref:`circuit mode ` renders it interactively. The remaining sections take such a token and walk it down to a probability. For the examples below, assign probabilities to the inputs first (see :doc:`probabilities`): .. code-block:: postgresql SELECT set_prob(provenance(), reliability) FROM sightings; From circuit to CNF ------------------- Before invoking an external compiler, ProvSQL encodes the Boolean circuit into a `CNF `_ formula by the `Tseytin transformation `_: one fresh variable per gate, plus clauses asserting that each gate variable is equivalent to the Boolean combination of its inputs. This is the exact encoding the extension streams to ``d4`` / ``c2d`` / ``minic2d`` / ``dsharp`` on a temporary file; :sqlfunc:`tseytin_cnf` returns it as text instead: .. code-block:: postgresql SELECT tseytin_cnf(provenance()) FROM suspects WHERE id = 1; The output is DIMACS. With the default ``mapping => true`` it opens with one ``c input`` comment line per input variable, the self-documenting variable mapping described :ref:`below `; the ``p cnf `` header and the clauses follow. With the default ``weighted => true``, the literal-weight lines required by `weighted model counters `_ are appended, one per input literal:: c input 1 7f3a2b1c-… 0.5 c input 2 9b2c4d5e-… 0.6 c input 3 a1b2c3d4-… 0.7 p cnf 6 7 -4 0 5 0 -6 1 0 ... w 1 0.5 w -1 0.5 w 2 0.6 w -2 0.4 Pass ``weighted => false`` to obtain the bare CNF (no ``w`` lines), for plain (unweighted) model counting or for feeding a compiler that reads weights separately: .. code-block:: postgresql SELECT tseytin_cnf(provenance(), weighted => false) FROM suspects WHERE id = 1; .. _tseytin-cnf-mapping: Reading a model back: the variable mapping ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A DIMACS variable is just an integer, so a satisfying assignment or weighted count returned by an external tool is meaningless until you know which provenance input each variable stands for. This is what the ``c input`` lines at the top of the output above record, one per input variable: its DIMACS number, the originating provenance UUID, and its probability. They make the CNF self-documenting at no cost to the solver: model counters and compilers ignore ``c`` comment lines, so the file stays valid DIMACS. Pass ``mapping => false`` to omit them. Only input variables appear (the auxiliary Tseytin variables, one per gate, are not provenance inputs). The same information is available as a table through :sqlfunc:`tseytin_cnf_mapping`, which is what you want when reading a tool's output back programmatically: .. code-block:: postgresql SELECT * FROM tseytin_cnf_mapping( '00000000-0000-0000-0000-000000000000'); Feed it a concrete token, either a literal UUID or one materialised in a plain table. You cannot pull the token inline from a provenance-tracked relation in the same statement (``FROM suspects s, tseytin_cnf_mapping(s.provsql) m``): while ``provsql.active`` is on the planner hook refuses to rewrite a multi-column function applied to a tracked relation (*FROM function with multiple output attributes not supported*). Set ``provsql.active`` off for that pattern. The ``variable`` column matches the DIMACS numbering, ``gate`` is the original-circuit input UUID, and ``probability`` its weight. In ProvSQL Studio, the Tseytin CNF panel goes one step further and annotates each ``c input`` line with the *source tuple* the variable came from (resolved from the tracked relations), so you can see at a glance that, say, variable 3 is ``suspects(5, …)``. From CNF to d-DNNF ------------------ An external **knowledge compiler** turns the CNF into a deterministic, decomposable `negation normal form `_ (d-DNNF), on which weighted model counting (hence probability) is linear-time. :sqlfunc:`compile_to_ddnnf_dot` runs the requested compiler and returns the resulting d-DNNF as a GraphViz digraph over ``AND`` / ``OR`` / ``NOT`` / input gates: .. code-block:: postgresql SELECT compile_to_ddnnf_dot(provenance(), 'd4') FROM suspects WHERE id = 1; The second argument names the compiler. ProvSQL ships bindings for: ``'d4'`` Decision-DNNF compiler with backtracking :cite:`DBLP:conf/ijcai/LagniezM17`; the default for the ``compilation`` probability method. ``'d4v2'`` The rewritten d4 :cite:`LagniezM21d4v2`. ``'c2d'`` The original dtree-guided compiler :cite:`DBLP:conf/ecai/Darwiche04`. ``'minic2d'`` Top-down CNF-to-SDD compiler :cite:`DBLP:conf/ijcai/OztokD15`. ``'dsharp'`` DPLL-style compiler built on sharpSAT :cite:`DBLP:conf/ai/MuiseMBH12`. ``'panini-obdd'``, ``'panini-obdd-and'``, ``'panini-decdnnf'`` Three target languages of KCBox's Panini compiler :cite:`KCBoxPanini`: * ``OBDD`` is the canonical `ordered Boolean decision diagram `_, a strict subset of d-DNNF whose decisions are restricted to a global variable order. * ``OBDD[AND]`` augments OBDD with internal AND nodes :cite:`DBLP:journals/jair/LaiLY17`, recovering a more compact representation while keeping polynomial Apply. * ``Decision-DNNF`` drops the variable order, retaining only the decomposability + determinism of d-DNNF; it is the canonical target of ``d4``. Panini also ships ``R2-D2`` and ``CCDD`` target languages :cite:`DBLP:journals/corr/abs-2202-10025`. ProvSQL does **not** expose them: both emit ``K`` (kernelize) nodes encoding literal-equivalence constraints over a shared kernel variable, which break the decomposability invariant of a d-DNNF. A direct AND-translation gives silently-wrong probabilities; a correct translation requires case-splitting on the kernel variables and is not yet implemented. Each compiler must be installed and reachable on the PostgreSQL server's ``PATH``, or in a directory listed in the ``provsql.tool_search_path`` GUC (see :doc:`configuration`). The same query, compiled by different tools, yields different d-DNNFs for the same Boolean function: comparing their size and sharing is instructive, and is one of the things the Studio knowledge-compilation panel makes visual. Exporting the d-DNNF as text ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ :sqlfunc:`compile_to_ddnnf_dot` is for the eye. To get the compiled circuit as a machine-readable artifact -- to feed it to an external d-DNNF reasoner or verifier, or to archive it -- :sqlfunc:`compile_to_ddnnf` returns the same circuit in the standard c2d / d4 ``.nnf`` text format: .. code-block:: postgresql SELECT compile_to_ddnnf(provenance(), 'd4') FROM suspects WHERE id = 1; The output opens with an ``nnf <#nodes> <#edges> <#vars>`` header, then one line per node: ``L `` for a literal leaf, ``A …`` for an AND over ``k`` earlier nodes, ``O …`` for an OR (``j`` is the decision variable, ``0`` when ProvSQL records none). It accepts the same compiler / meta-route names as :sqlfunc:`compile_to_ddnnf_dot`. The literal variables use the **same numbering as** :sqlfunc:`tseytin_cnf`, even when an external compiler renumbered them internally, so a ``.cnf`` and a ``.nnf`` of the same circuit cross-reference and :sqlfunc:`tseytin_cnf_mapping` interprets both. In Studio the *Compiled d-D (NNF text)* eval-strip option shows the NNF and the copy button yields it verbatim. Measuring the compiled d-DNNF ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To make that comparison quantitative rather than visual, :sqlfunc:`ddnnf_stats` returns the structural statistics of the d-DNNF a given compiler produces, as a jsonb object: .. code-block:: postgresql SELECT jsonb_pretty(ddnnf_stats(provenance(), 'd4')) FROM suspects WHERE id = 1; The object reports ``nodes`` and ``edges``, the ``and`` / ``or`` / ``not`` / ``inputs`` gate-type split, whether the result is ``smooth`` (every OR gate's children share their variable set), the longest-path ``depth``, the circuit ``treewidth`` (``null`` when it exceeds the supported limit), and the ``compile_ms`` wall-clock compile time. It accepts the same compiler names as :sqlfunc:`compile_to_ddnnf_dot`, including the in-process ``interpret-as-dd`` and ``tree-decomposition`` routes, so a single string change re-measures a different compiler on the same circuit:: { "and": 3, "or": 4, "not": 0, "inputs": 5, "nodes": 12, "edges": 14, "depth": 6, "smooth": true, "treewidth": 3, "compiler": "d4", "compile_ms": 1.83 } In Studio, the compiled-d-DNNF canvas shows a gates / edges / depth summary in its subtitle, and the probability benchmark (below) adds a ``d-DNNF (n/e)`` column so the size of every compiler's output sits beside its run time in one table. The in-process routes --------------------- ProvSQL also ships two **built-in** compilers that need no external tool. Both are accepted wherever a compiler name is, by :sqlfunc:`compile_to_ddnnf_dot`, :sqlfunc:`compile_to_ddnnf`, :sqlfunc:`ddnnf_stats`, and the matching :sqlfunc:`probability_evaluate` methods. ``interpret-as-dd`` reinterprets the provenance circuit *directly* as a d-DNNF, with no compilation step, reading each ``times`` as an AND of independent children and each ``plus`` as an *independent* OR, the latter rewritten by De Morgan into a ``NOT`` over an AND of ``NOT``\ s (``¬(¬a ∧ ¬b ∧ …)``) so the result stays a genuine d-DNNF. It is therefore exact only on circuits whose gates are genuinely independent, the shape an independent or read-once query produces; it does not try to certify that ``plus`` gates are deterministic (mutually exclusive), which would be expensive to assert. Gate types it cannot read this way raise an unsupported-gate error. It is the cheapest route, and the one the default method tries first. ``tree-decomposition`` is the structural fallback: it builds a `tree decomposition `_ of the Boolean circuit and compiles it to a d-DNNF in time linear in the circuit and singly-exponential in the `treewidth `_ :cite:`DBLP:journals/mst/AmarilliCMS20`. It fails, raising a *Treewidth greater than N* error, when the treewidth exceeds the supported limit; the default method then falls through to an external compiler (the ``provsql.fallback_compiler`` GUC, default ``d4``). :sqlfunc:`tree_decomposition_dot` exposes the underlying min-fill decomposition as DOT, so the variable-elimination order that yields the in-process d-DNNF is itself inspectable. Its first line is a comment carrying the treewidth:: // treewidth=3 .. code-block:: postgresql SELECT tree_decomposition_dot(provenance()) FROM suspects WHERE id = 1; Weighted model counters: the ``wmc`` umbrella --------------------------------------------- The ``wmc`` method dispatches to a family of exact weighted model counters, each consuming the same DIMACS CNF emitted by :sqlfunc:`tseytin_cnf` (with the literal-weight ``c p weight`` lines required by the MCC 2024 input format): ``'ganak'`` A projected weighted model counter from the meelgroup :cite:`DBLP:conf/ijcai/SharmaRSM19`. Compact and self-contained; requires a single ``ganak`` binary on ``PATH``. ``'sharpsat-td'`` Tree-decomposition-guided exact counter :cite:`DBLP:conf/cp/KorhonenJ21`. Needs the ``flow_cutter_pace17`` helper alongside ``sharpsat-td`` so the counter can shell out to its tree-decomposer. ``'dpmc'`` Two-stage planner + executor pipeline :cite:`DBLP:conf/cp/DudekPV20`: ``htb`` builds a project-join tree, ``dmc`` consumes it. Both binaries must be installed. ``'weightmc'`` The approximate ApproxMC variant (kept under the umbrella for discoverability; see :doc:`probabilities` for its ``epsilon;delta`` knob). Invocation goes through :sqlfunc:`probability_evaluate`: .. code-block:: postgresql SELECT probability_evaluate(provenance(), 'wmc', 'ganak') FROM suspects WHERE id = 1; Switching between counters is a single-string change, which is what the benchmark below exploits to compare them on the same circuit. Inspecting the in-memory circuit -------------------------------- Before any compilation runs, the ``provsql.simplify_on_load`` GUC may have already rewritten parts of the circuit: identity / absorber collapses, ``gate_cmp`` resolutions for circuits over random variables, hybrid-evaluator simplifications (see :doc:`probabilities`). :sqlfunc:`simplified_circuit_subgraph` returns the post-simplification DAG, rooted at the given token, as a JSON adjacency list. This is the exact shape the probability evaluators traverse: .. code-block:: postgresql SELECT jsonb_pretty(simplified_circuit_subgraph(provenance())) FROM suspects WHERE id = 1; Each node carries its gate type, an inline ``extra`` field for typed leaves (random variables, ``cmp`` thresholds), and the longest-path depth from the root. The longest path is the canonical circuit-depth notion: it tracks the deepest chain of operators between the node and the output, which is what governs evaluation cost and matches the depth a renderer would draw. Comparing probability methods ----------------------------- :sqlfunc:`probability_benchmark` times every probability-evaluation method on a single circuit token and returns one row per method with its wall-clock duration and result. It runs ``independent``, ``possible-worlds``, ``tree-decomposition``, ``monte-carlo``, each external compiler through the ``compilation`` method, and each weighted model counter under the ``wmc`` umbrella. Methods that cannot apply (an uninstalled compiler, a non-independent circuit, a treewidth blow-up, …) are captured per row in an ``error`` column rather than aborting the whole comparison. ``probability_benchmark`` takes a concrete token, so feed it a literal UUID or one materialised in a plain table, exactly like :sqlfunc:`tseytin_cnf_mapping` above (and with the same restriction: do not call it inline over a provenance-tracked relation while ``provsql.active`` is on): .. code-block:: postgresql SET provsql.monte_carlo_seed = 42; -- reproducible Monte-Carlo row SELECT method, args, ROUND(probability::numeric, 6) AS prob, ROUND(milliseconds::numeric, 1) AS ms, error FROM probability_benchmark('00000000-0000-0000-0000-000000000000') ORDER BY method, args NULLS FIRST; The exact methods agree to numerical precision; the approximate ones (``monte-carlo``, ``weightmc``) land within their confidence band. The second and third arguments tune the Monte-Carlo sample count (default ``10000``) and the ``epsilon;delta`` forwarded to ``weightmc`` (default ``'0.8;0.2'``). Checking tool availability -------------------------- Because the compilers and model counters are optional external binaries, :sqlfunc:`tool_available` reports whether a given tool resolves to an executable on the backend's effective ``PATH``, including the ``provsql.tool_search_path`` prefix, exactly as a subsequent ``compilation`` or ``view_circuit`` call would see it: .. code-block:: postgresql SELECT tool_available('d4'), tool_available('c2d'); A bare name is resolved through the shell; a name containing a slash is tested as a path. Studio uses this to grey out compilers that are not installed rather than letting them fail at run time. In ProvSQL Studio ----------------- The knowledge-compilation pipeline is available without leaving the browser. In Studio's :ref:`evaluation strip `, the same artifacts surface as inline panels: the DIMACS CNF, the compiled d-DNNF rendered beside the original circuit, the tree decomposition with its treewidth, and a one-click benchmark across all probability methods. Compilers that :sqlfunc:`tool_available` reports as missing are filtered out of the pickers, and the Studio benchmark skips those methods altogether: unlike the SQL :sqlfunc:`probability_benchmark`, which still emits a row per method and records the failure in its ``error`` column, their rows simply do not appear. .. seealso:: - :doc:`probabilities` for the probability methods these artifacts feed. - :doc:`semirings` for the broader semiring-evaluation surface. - :doc:`export` and :sqlfunc:`view_circuit` for circuit visualisation. - :doc:`configuration` for ``provsql.tool_search_path`` and related GUCs.