# Möbius-inversion route for safe UCQs Plan for the last missing exact route of the Dalvi-Suciu dichotomy: unions of conjunctive queries that are safe **only because** the #P-hard terms of their inclusion-exclusion expansion carry a zero Möbius value on the CNF lattice and cancel. Anchored on Dalvi & Suciu, *The dichotomy of probabilistic inference for unions of conjunctive queries* (JACM 59(6), 2012) and its algorithmic core, Dalvi, Schnaitter & Suciu, *Computing query probability with incidence algebras* (PODS 2010); the LiftR reformulation in Gribkoff, Van den Broeck & Suciu (UAI 2014, arXiv:1405.3250); and Monet & Olteanu, *Towards deterministic decomposable circuits for safe queries* (AMW 2018, arXiv:1912.11098) for the lattice/Möbius computations at scale and the circuit-size impossibility results. This supersedes the *Möbius / inclusion-exclusion via Monet 2019's construction* entry of [`safe-query-followups.md`](safe-query-followups.md): that entry deferred an **intensional** route (building Monet's d-D circuit, arXiv:1912.11864) on practicality grounds that still stand. The route planned here is the **extensional** lattice-walking algorithm itself, packaged the way the joint-width route already is: a planner-time analysis attaching a certificate, an execution-time compiler producing a certified circuit over the gathered data, and a linear evaluation -- so the only genuinely new evaluation primitive is a signed linear combination at the root. Design requirements (fixed): 1. fires only when nothing better is available -- the safe-query rewriter, the inversion-free certificate, and the joint-width (data + correlations) route all keep priority; 2. invoked directly from the planner hook, no manual intervention: `SELECT probability(provenance()) FROM (… UNION …) t` just works; 3. easily demonstrated -- the demo query is the canonical QW / q9, with a stats surface that shows the lattice and the cancelled hard element. ## The gap (why) Current exact-route coverage of the safe-query landscape: | route | class covered | mechanism | |---|---|---| | safe-query rewriter (`src/safe_query.c`) | hierarchical self-join-free CQs (+ FD/constant extensions) | plan-time rewrite to read-once circuits, `independent` evaluation | | inversion-free certificate | safe queries needing no inclusion-exclusion cancellation | per-input markers + `inversion-free` method, O(S + N log N) | | joint-width (`src/joint_width_query.c`) | **data**-dependent: bounded joint treewidth, correlations allowed | descriptor at plan time, d-DNNF compiled at execution, linear evaluation | | compilation / sieve / possible-worlds | small circuits, regardless of safety | exponential in circuit parameters | The missing class is "safe but inversion-needed". The canonical witness, due to Dalvi-Suciu and named QW in UAI 2014 (q9 in Monet & Olteanu), is built from the four hard-boundary CQs over relations `R(·)`, `S1(·,·)`, `S2(·,·)`, `S3(·,·)`, `T(·)`: h0 = ∃x∃y R(x) ∧ S1(x,y) h2 = ∃x∃y S2(x,y) ∧ S3(x,y) h1 = ∃x∃y S1(x,y) ∧ S2(x,y) h3 = ∃x∃y S3(x,y) ∧ T(y) q9 = (h2 ∨ h3) ∧ (h0 ∨ h3) ∧ (h1 ∨ h3) ∧ (h0 ∨ h1 ∨ h2) (as a UCQ: distribute into a union of 8 CQs, of which 5 survive minimisation). Its C-lattice has 9 elements; writing each element as the disjunction of the `hi` it contains, the Möbius values µ(u, 1̂) are: | element λ(u) | µ(u, 1̂) | |---|---| | q9 itself (top) | 1 | | h2∨h3, h0∨h3, h1∨h3 (co-atoms) | -1 each | | h0∨h2∨h3, h1∨h2∨h3, h0∨h1∨h3 | +1 each | | h0∨h1∨h2 | -1 | | h0∨h1∨h2∨h3 = H3 (bottom) | **0** | so P(q9) = P(h2∨h3) + P(h0∨h3) + P(h1∨h3) - P(h0∨h2∨h3) - P(h1∨h2∨h3) - P(h0∨h1∨h3) + P(h0∨h1∨h2) Every right-hand term is a safe disjunctive sentence (a separator exists, recursively); the bottom H3 -- which is #P-hard -- has µ = 0 and never needs to be evaluated. Naive inclusion-exclusion would hit H3; lifted conditioning provably gets stuck on this lattice (PODS'10 §7); the Möbius grouping is what makes the query PTIME. No ProvSQL route handles q9 in PTIME today: the safe-query rewriter is per-CQ and hierarchical, the query is not inversion-free (that is the point), and on adversarial data the joint treewidth is unbounded (q9 also has no poly-size OBDD/FBDD/dec-DNNF -- Beame, Li, Roy & Suciu -- so generic compilation provably cannot stay polynomial). No published system implements this step either: SlimShot's rule table (PVLDB 2016) only has plain binary inclusion-exclusion and would fail on QW; Monet & Olteanu computed lattices and Möbius values for millions of queries, but offline, in Python. ProvSQL would be the first system with the complete safe-UCQ algorithm integrated in a query planner. ## Out of scope - **Ranking / shattering** (Dalvi-Suciu §5: rewriting relations by attribute order types so that the homomorphism criterion and separator existence become exact). Required for within-disjunct self-joins, repeated variables inside an atom, and constants in full generality; it multiplies relations by order types and is the main practical annoyance of the published algorithm. v1 restricts to *reduced-form* UCQs (see Plan, gate G5); ranking is Increment 3 and may never be needed by real workloads. Constants alone can come earlier via the constant-selection shattering idea already in `safe_query.c`. - **Correlated / BID inputs.** Lifted inference is only sound under tuple independence; correlated tokens stay with the joint-width route or fall through to the generic pipeline. Not a restriction to lift later -- it is a soundness boundary. - **Monet's intensional d-D construction** (arXiv:1912.11864): still deferred, same cost-benefit as recorded in [`safe-query-followups.md`](safe-query-followups.md). The present route makes it even less urgent, since the extensional algorithm covers all safe UCQs (of the supported reduced form), not just the fragment that paper handles. - **Extensional SQL plans** (SlimShot-style: generate per-element safe-plan SQL with `exp(sum(log(1-p)))` aggregates and combine in SQL arithmetic). Rejected as the artifact: it bypasses the token/circuit contract (`probability()` operates on a token, and the token must remain a usable Boolean provenance for every other surface), duplicates probability plumbing into SQL, and has no precedent in the codebase, whereas the compile-at-execution certified-circuit design has the entire ucq_joint machinery as a template. ## Plan ### 1. Shared UCQ extraction `provsql_joint_width_descriptor` (`src/joint_width_query.c`) already recognises exactly the input shape this route needs: UNION-of-SFW or single SFW under `provsql.boolean_provenance`, normalised to disjuncts → atoms → variable positions, with per-answer heads canonically numbered. Factor the recognition/normalisation part out into a shared module (precedent: the `qc_` qual-classification split out of `safe_query.c`), consumed by both the joint-width descriptor builder and the new Möbius analyser. One UCQ representation, two back-ends. ### 2. Query-side analysis (`src/mobius_query.c`) All steps below are query-complexity only (the query is tiny; the data is never touched). Caps where a step is exponential in the query, with an `elog(DEBUG1)` when a cap fires -- no silent truncation. 1. **Components.** Split each disjunct into connected components γ of its shared-variable graph (a disjunct like `R(x), T(y)` contributes two components). 2. **DNF → CNF.** Distribute components: Φ = ⋀_f ⋁_i γ_{i,f(i)}. Exponential in the number of disjuncts; cap at M ≤ 8 CNF conjuncts (Monet & Olteanu handled millions of 4-clause lattices, so this is generous). 3. **C-lattice construction.** Elements are ϕ'_s = ⋁_{i∈s} ϕ'_i for s ⊆ [M], **collapsed up to logical equivalence**; meet = subset union. Equivalence and the simplification of each element are CQ-containment tests (homomorphism search; NP in query size, trivial at these sizes). The collapsing step is what creates the zero-Möbius cancellations -- it is the whole game. Prune to co-atomic elements: only meets of co-atoms can have µ(u, 1̂) ≠ 0 (PODS'10 Prop. 3.4). 4. **Möbius values** µ(u, 1̂) by the standard incidence-algebra recursion, top-down (µ(1̂,1̂) = 1; µ(u,1̂) = -Σ_{u