# Exploiting bounded-treewidth data Design space for exploiting *bounded treewidth of the input data* in ProvSQL, in the spirit of Courcelle's theorem and its provenance refinement. Anchored on Amarilli, Bourhis & Senellart, *Provenance Circuits for Trees and Treelike Instances* (ICALP 2015) and *Combined Tractability of Query Evaluation via Tree Automata and Cycluits* (ICDT 2017). Complements [`safe-query-followups.md`](safe-query-followups.md), whose tractability classes are query-side; this one is data-side. ## The opportunity Courcelle's theorem: a fixed MSO query is evaluable in time linear in the data on instances of bounded treewidth, by compiling the query to a bottom-up tree automaton and running it over a tree decomposition of the instance. The provenance refinement strengthens this to provenance: for a fixed query `Q` and an instance `I` of treewidth `k`, one can compute, in time `O(f(k,Q) * |I|)`, a provenance circuit whose treewidth is bounded by `g(k,Q)` -- *independent of* `|I|`. That bounded-treewidth circuit is exactly what `TreeDecomposition` + `dDNNFTreeDecompositionBuilder` turn into a d-DNNF for linear-time probability evaluation (`dDNNF::probabilityEvaluation`). ## It is not a free lunch The guarantee holds only for a circuit **constructed the right way**. ABS bound the treewidth of a provenance circuit *constructed along a tree decomposition of the data*; nothing forces the lineage to acquire that bound automatically. ProvSQL builds provenance along the **relational-algebra plan** instead: the planner hook combines per-relation tokens with `plus` / `times` / `monus` in `make_provenance_expression` (`src/provsql.c`), and the resulting gate DAG mirrors the join plan. The knowledge compiler then decomposes its primal graph (`Graph(const BooleanCircuit&)`, `src/Graph.h:48`; `TreeDecomposition(const BooleanCircuit&)`). **At no point does the data's adjacency structure inform the gate layout**, so nothing certifies the circuit's treewidth ahead of building it. Empirically the dividing line is **local-vs-non-local**. Local joins on bounded-degree data (path / grid UCQs) stay good: treewidth is governed by `tw(I)` and the query, flat in `|I|`. Two ingredients break it without leaving the relational fragment -- **high degree** (a self-join through a star centre, `tw(I)=1`, yields treewidth `n`) and **cross-products** (`SELECT DISTINCT 1 FROM e a, e b` yields treewidth `n`) -- where ProvSQL materialises the full `Theta(n)`-treewidth sum-of-products despite a trivial optimal circuit. The absorption-reachable case (`SELECT DISTINCT 1 FROM e a, e b` with a join) folds to treewidth 1 today via the joint-fixpoint Boolean fold in `foldBooleanIdentities`; the threshold case (the in-star self-join, "at least two edges true") does not (see Route 3). The **recursive fragment** compounds this: reachability on a bounded-treewidth grid grows circuit treewidth with `|I|` and crosses `MAX_TREEWIDTH` even though `tw(I)` is a small constant -- the one place the gap is intrinsic and `|I|`-driven. This is a different axis from the deferred item in [`safe-query-followups.md`](safe-query-followups.md) ("Other tractable CQ subclasses"), which considers treewidth of the *query hypergraph* (a query-side parameter). The present study is about *data* treewidth, which changes *what circuit we construct*. ## Out of scope - **Query-hypergraph (query-side) treewidth** -- covered, and deferred, in [`safe-query-followups.md`](safe-query-followups.md). - **Circuit-side treewidth** -- already exploited; that is precisely what `TreeDecomposition` + `dDNNFTreeDecompositionBuilder` do today. - **Re-proving the ABS / Courcelle guarantees** -- cited, not reproved. ## Open work ### Route 3 -- extend independent-product factoring (`costar` threshold half) The relational explosions are an artefact of emitting the full sum-of-products when a much smaller equivalent circuit exists. ProvSQL's aggregation path already factors *independent* products optimally (`(OR r_x) AND (OR s_y)` for a disjoint cross-product, treewidth 1), and the absorption-reachable self-join now folds to treewidth 1 via the joint-fixpoint Boolean fold. What remains is the **threshold** case: the in-star self-join collapses to a symmetric threshold-2 (`OR_{i 0 FROM reach WHERE v = $t; -- is t reachable from s? ``` over a probabilistic `edge(src, dst)` whose graph has bounded treewidth: series-parallel networks (`tw=2`), outerplanar graphs, transit / utility networks, workflow / process graphs. Why this family: - *Genuine capability gap, not a speedup.* `Pr[t reachable from s]` is two-terminal network reliability -- `#P`-hard in general (Valiant), linear-time on bounded treewidth. Competing PDBs (MayBMS) cannot express it once connectivity is needed (MSO, not FO). - *ProvSQL cannot do it tractably today.* `eval_recursive` is acyclic / Boolean only and emits a strictly-DAG circuit of unbounded size on cyclic data; reachability circuit treewidth *grows with `|I|`* and crosses the cap even at constant `tw(I)`. - *The cheap levers do not apply.* It is recursive, so the Boolean fold and the route-3 threshold / factoring recogniser cannot reach it. Only a decomposition-aligned (cycluit) construction helps. And the automaton is small. For directed `s -> t` reachability the state at a bag is the subset of currently-active vertices already known reachable from `s`, plus a "reached `t`" flag: ``` introduce s : R = {s} edge (u,v) present, u in R : R := R ∪ {v} ( gated by x_e ) forget w : if w = t and w in R, set done; drop w from R combine (join bag) : R := R1 ∪ R2 ; done := done1 OR done2 final : done (or t in R at the root) ``` `2^(k+1)` states (`*2` for the flag) -- for series-parallel data (`tw=2`) a handful -- so the ABS circuit has constant treewidth `O(2^k)` and feeds the existing `dDNNFTreeDecompositionBuilder` unchanged. Undirected connectivity is the same idea with a *partition* of the bag in place of a subset (`Bell(k+1)` states). The minimal seed: take `tw(I)=1` outright -- **probabilistic tree / forest data** (XML / JSON, taxonomies, phylogenies, org charts), the original ABS "trees" case. The tree decomposition is the document itself (no min-fill), tree patterns map to tree automata natively, and ancestor / reachability queries collapse to a 2-3 state automaton. It ties into the probabilistic-XML line and makes a clean first prototype before general bounded-treewidth graphs. ### Route A -- full pipeline: data decomposition + tree automaton (deferred) The faithful ABS construction. Build a tree decomposition of the instance's Gaifman graph; compile `Q` to a bottom-up tree automaton; run the automaton over the tree encoding while emitting provenance gates per transition; hand the resulting bounded-treewidth circuit to the existing d-DNNF builder. - *Cost: very high.* FO / MSO to tree-automaton compilation is non-elementary in `|Q|` in the worst case; even restricted to CQ / UCQ, the encoding and the automaton construction are research-grade work to land inside a Postgres extension. - *Payoff:* the only route that delivers the provable linear-time guarantee on the queries where the lineage blows up. - *Status:* deferred until a workload defeats the lighter levers (the Boolean fold and route-3 factoring already address the absorption-reachable relational shapes), against the same d4-on-natural-lineage cost-benefit that deferred Monet's construction in [`safe-query-followups.md`](safe-query-followups.md). - *Prior art -- a standalone prototype already exists.* Mikaël Monet's MPRI M2 internship (2015, supervised by Senellart), *Probabilistic Evaluation of MSO Queries on Bounded Treewidth Instances*, implements this exact pipeline in Java: the `libtw` (`nl.uu.cs.treewidth`) tree-decomposition of the instance's Gaifman graph, the `lethal` tree-automata library, an on-the-fly automaton run that emits a Boolean provenance circuit, circuit-shrinking optimisations (final-state collapse, fact-irrelevance detection), linear-time probability evaluation, and a head-to-head benchmark against MayBMS. Its `tree/FunctionAutom*.java` are hand-written per-query automata and `tree/SFC.java` is a hand-coded MSO connectivity automaton. The report's stated limits corroborate this study: (i) MSO-to-automaton compilation is non-elementary and *was done by hand* (no implemented compiler), exactly the "genuinely new hard core" below; (ii) "treewidth quickly becomes a limiting factor". It is a reusable design reference (especially the on-the-fly automaton and the propagator-state trick that keeps the circuit linear) rather than something to port, so Route A is not a from-scratch research bet. #### What the automaton looks like A bottom-up automaton `A = (Q, delta, F)` runs over a tree encoding: bag elements carry names `1..k+1`, and a node may carry one fact `R(i1, ...)` over active names whose provenance annotation `x_t` is a Boolean input. The provenance circuit is then mechanical -- a gate `g[nu,q]` per node `nu` and state `q`: - leaf carrying fact `t` (present `-> q+`, absent `-> q-`): `g[nu,q+] = x_t`, `g[nu,q-] = NOT x_t`; - internal node, children `l`, `r`: `g[nu,q] = OR over {delta(q1,q2,a)=q} of (g[l,q1] AND g[r,q2] AND chi_a)`; - output: `OR over q in F of g[root,q]`. Size is `O(|Q|^2 * |E|)` (linear in the instance); **treewidth is `O(|Q|)`, independent of `|I|`**, because wires only ever connect a node's gates to its children's. So the **state count *is* the treewidth bound**. The pathological relational shapes are all constant-state automata -- e.g. the in-star self-join ("at least two of `n`") becomes a running counter capped at 2, three gates per spine node, the textbook linear-size treewidth-~3 sequential-threshold circuit for the same function ProvSQL emits as a `Theta(n)`-treewidth sum-of-products. The two moves a relational-plan builder has no analogue for are: **restrict the state to the active bag** (forget = drop) and **cap the summary**. This is the concrete case for Route C over more folding, and shows the bound is a property of *how* the circuit is built, not of the query fragment. ### Semiring threads - **Boolean / probability** -- primary thread, served by the existing `TreeDecomposition` -> `dDNNFTreeDecompositionBuilder` -> `dDNNF::probabilityEvaluation` path (already linear-time on bounded treewidth); matches the `provsql.boolean_provenance` regime. - **General m-semirings** -- the ABS construction is semiring-agnostic, but ProvSQL's any-semiring evaluation runs over the `GenericCircuit` (`provenance_evaluate_compiled`), which does not exploit treewidth at all. A bounded-treewidth `GenericCircuit` would still need a *width-aware semiring evaluator* (a bag-by-bag dynamic program analogous to the d-DNNF builder, but over the semiring carrier) to cash in the bound. A separate, larger axis; not assumed by the Boolean thread. ## Reuse map Reusable as-is, or with a thin adapter: - Min-fill elimination, `PermutationStrategy`, and the `Graph` mutating ops (`remove_node`, `fill`) -- but `Graph`'s only constructor takes a `BooleanCircuit` (`src/Graph.h:48`); a new builder taking a relational instance's Gaifman graph is needed. - `TreeDecomposition`'s bag / tree representation, `makeFriendly`, and the PACE-format reader (`src/TreeDecomposition.h`) -- the bag-tree structure is generic; today it is only ever built from a circuit. - `dDNNFTreeDecompositionBuilder` and the whole probability path -- consume a bounded-treewidth `BooleanCircuit` unchanged. Genuinely new (the hard core): - A tree *encoding* of a relational instance over a finite alphabet. - Query-to-tree-automaton compilation, even for a restricted fragment. - The provenance-emitting automaton run (a gate per transition), including, for Route C, the cyclic-provenance (cycluit) semantics that the strictly-DAG circuit framework does not currently support. Monet's 2015 Java prototype (see Route A, *Prior art*) is a worked, end-to-end reference for the first three in the relational (acyclic) case, so the design space is mapped even where the code is not directly portable. Only the cyclic-provenance (cycluit) semantics for Route C is genuinely uncharted. ## Priorities 1. **Route C** (recursive / reachability on bounded-treewidth graphs) -- the only place treewidth genuinely *grows* with `|I|` at constant `tw(I)`; the automaton is trivial; it composes with the in-flight recursive-query work. The substantive build. 2. **Factoring lever (route 3), `costar` threshold half** -- structural detection; scope only if a real workload exhibits it. 3. **Route A** (full automaton) -- deferred until the factoring lever proves insufficient.