# Probability evaluation: remaining work (d-trees, HAVING trichotomy, catalog) Open work only. The built machinery – the method catalog + three-path chooser, the d-tree engine (generalised to arbitrary `AND`/`OR`/`NOT`/`IN` circuits, under a speculative subproblem budget that also covers `tree-decomposition`), the HAVING marginal-vector evaluators, and the apx-safe `SUM`/`AVG`/`MIN`/`MAX` sampling route – is described in `doc/source/dev/probability-evaluation.rst`. Anchor papers (all in `website/_bibliography/references.bib`): Olteanu, Huang & Koch (d-tree, ICDE 2010); Ré & Suciu (HAVING trichotomy, VLDB J. 2009); Souihli & Senellart (ProApproX portfolio, ICDE 2013). The three user-facing paths are tolerance grants – **exact ⊂ relative ⊂ additive** – not algorithm picks; the chooser returns the cheapest *admissible* method. Every method declares `guaranteeKind` / `isDeterministic` / `applicable` / `estimatedCost`, so new work below must fit that frame. ## 1. d-tree: closed The anytime engine handles arbitrary Boolean circuits (exact + certified bounds) under a per-method subproblem budget. The BID / multivalued gap is fixed (the `ProbabilityMethod::handlesMultivalued()` central rewrite). The two remaining polish ideas were evaluated and **dropped** -- recorded here so they are not re-attempted: - **Non-DNF exact auto-selection.** To join the exact auto-chain the general recursion needs a cost estimate, but the d-tree's cost has no usable proxy (the code's own comment: "the treewidth proxy is NOT used -- it mispredicts this engine"). Leaving it by-name-only for exact is the honest choice. - **Memoise the approximate path** (cache exact sub-results during an approximate recursion). *Tried.* It speeds d-tree itself ~20-30%, but on `big_rare` it makes the auto `relative` path **11x slower** (681 ms -> 7.9 s): by letting d-tree converge within its step budget it suppresses the beneficial escalation to `compilation:d4`, and keeping that escalation only where d4 actually wins requires the same (unsolvable) cost prediction as the item above. Net regression; reverted. ## 2. The SUM-safe rounding FPTRAS (`AggFptrasMethod`) The apx-safe `SUM` / `AVG` / `MIN` / `MAX` corner is already covered *approximately* by direct world-sampling (the DKLR stopping rule over the surviving `gate_agg`), an FPRAS when `p ≥ 1/poly`. What remains is the **rounding-based rejection FPTRAS** (Thm 9 / Alg 6.3.1) proper, whose value over that stopping rule is **rare-event efficiency** – a sample count that does not blow up as `p → 0`, via the rounded proposal's bounded acceptance ratio – plus the **MIN/MAX hard direction** on a safe skeleton. Deciding skeleton safety here would re-use the rewriter's hierarchical analysis (`find_hierarchical_root_atoms`, `src/safe_query.c`); an unsafe skeleton is hazardous (no FPTRAS, warn and fall back to additive MC). **ProvSQL is now well-positioned for this.** The marginal-vector engine (`AggMarginalEvaluator.cpp`) *is* the forward pass of Alg 5.2.1, and the world generator is literally that recursion run in reverse (piece 2 below); the hardest descent node – the Cartesian-product `⊗` split of a value-carrying branch-spanning SUM – needs the per-factor **joint `(sum, count)` distribution**, which is exactly the `sumCountPMF` built for the branch-spanning work. So the one prerequisite that used to be missing is now in place. **But the gap it closes is narrow.** The rounding FPTRAS only beats the existing DKLR sampler on *safe-skeleton SUM with large-magnitude incommensurate values* (so the exact pseudo-poly path bails) **and** a *small target probability* (so DKLR's `Θ(1/p)` sample count is prohibitive) – the intersection of two corners each already handled. It is the sole place ProvSQL is provably weaker than the paper, but rarely the bottleneck in practice. Recommendation: keep deferred; implement only when that rare-event SUM corner actually bites. Thm 9 / Alg 6.3.1, three pieces: 1. *Rounding.* `τ^R(y) = ⌊(n²/k)·y⌋`, rounded-sum semiring `S_{n²+1}` (size `n²+1`, polynomial). Compute the rounded-sum PMF **exactly** with the marginal-vector recursion already built (`sumPMF` / `countPMF`, well under `kMaxSumSupport`), driven by the query-level safe plan from `find_hierarchical_root_atoms`. 2. *Random-world generator* (Alg 5.2.1, the one genuinely new subroutine): `sampleWorldWithValue(node, target_s)` walks the laminar recursion tree **top-down**, splitting `target_s` among children ∝ marginal-vector entries (`⊕`: pick `s₁+s₂=s` w.p. `m^φ₁[s₁]·m^φ₂[s₂]/m^φ[s]`; `⊗`: `s₁·s₂=s`; `⊔`: route all to one branch). **This is the bottom-up `countPMF` / `sumPMF` / `sumCountPMF` / `decomposeProduct` recursion run in reverse** (same tree, same vectors, the draw descends instead of folding up). Two concrete hooks: (i) the forward pass currently *discards* the per-node marginals – they must be retained as a lightweight `(combinator, child-PMFs, leaves)` tree (`O(circuit)`) or recomputed on the way down; (ii) the `⊗` split of a branch-spanning value draws against the per-factor joint from `sumCountPMF`. Alg 5.2.2 fills off-plan tuples (maps onto the existing BID / `mulinput` block handling). 3. *Accept-test* (Lemma 7): draw a rounded value `∝ μ(rounded)`, draw a world, accept iff it is an *original* solution; `m = O(n·β⁻¹·ε⁻²·log δ⁻¹)` samples → relative `(ε,δ)`. New code `src/AggFptras.{h,cpp}`, consuming `AggMarginalEvaluator` internals (needs them exposed) + `safe_query.c`. The `(α,θ)` × skeleton admissibility map: | `(α, θ)` | safe skeleton | unsafe skeleton | |---|---|---| | `MIN ≤/<`, `MAX ≥/>` | FPTRAS (karp-luby) | **FPTRAS (karp-luby)** – gate-independent (`MAX≥k ⟺ ∃ present tuple, y≥k`, a UCQ; Thm 8) | | `MIN ≥/>`, `MAX ≤/<` | FPTRAS (safe-plan) | hazardous | | `SUM <,≤,≥,>` | FPTRAS (rounding+sampling) | hazardous | | `SUM =,≠`, `AVG =,≠` | hazardous | hazardous | | `COUNT` (any) | exact (dichotomy) | #P-hard; relative-MC if `p ≥ 1/poly` | **Research-grade; high risk of subtle statistical bias.** Deferred for collaborative implementation / review rather than autonomous coding. ## 3. HAVING marginal-vector engine: exact residual shapes Approximate coverage of these exists (item 2's sampling); the open work is the remaining **exact** (PTIME) coverage. The laminar / cross-product engine covers COUNT / SUM / MIN / MAX / AVG at arbitrary hierarchical depth; the residuals: - **Branch-spanning SUM** – both separable shapes are now exact in `src/AggMarginalEvaluator.cpp`: *additively separable* (`sum(b+c)`, `sum(2b-c+1)`) folds the per-factor joint `(sum, count)` distributions (`sumCountPMF`, `recoverAdditiveSeparation`); *multiplicatively separable* (`sum(b*c)`) is the product of per-factor weighted sums (`mulSeparableSumPMF`, a pivot identity that avoids explicit factorisation). Remaining: genuinely coupled values that are neither (`sum(b*c+b+c)`, a rank-≥2 weight tensor; may be `#P`-hard, self-gates back to enumeration today). - **UNION/EXCEPT over a join that re-uses a base tuple** – `(R⋈S) UNION (R⋈T)` → `(r∧s)⊕(r∧t)`, non-read-once on the shared `r`. The *independent* case (each contributor's footprint private – the usual one) is now exact: `contributorExactMarginal` (`src/AggMarginalEvaluator.cpp`) computes the contributor's exact marginal by brute force over its private leaves and models it as an independent one-alternative block (reusing the BID categorical machinery), for every aggregate; pinned by `test/sql/having_union.sql`. Remaining: a base tuple shared *across* a group's contributors, which couples them – the safe-query / read-once-rewriter problem, `#P`-hard in general. ## 4. Method-catalog follow-ups - **Lazy Boolean build.** RV / HAVING circuits with no Boolean view fall to a small top-level estimator outside the catalog (the surviving-aggregate sampler routing of item 2 is a partial step). A true lazy Boolean build would fold even those into `chooseAndRun` so all three paths are catalog-driven uniformly. - **Guarantee propagation** (decompose the whole-query `(ε,δ)` at **independence-certified gates only**): independent-OR `ε ≈ max(ε₁,ε₂)`, mutex-OR `max`, independent-AND `ε₁`; `1−δ = (1−δ₁)(1−δ₂)`. A `plus` whose children **share leaves** is not independent-OR, so the unit is a *maximal correlated component* (correlated cmps stay together and are sampled jointly; independent structure composes exactly). An exactly-cleared component contributes `ε=0`, loosening the budget for approximate siblings (the per-part payoff). **`monus` propagation is not in the paper** – derive the `Pr(monus(a,b)) = Pr(a)(1−Pr(b))` ε-propagation and its unsound-under-sharing caveat. This is also the principled fix for the `karp-luby` `S·m` cost being pessimistic for large `m` (the calibrated per-node ε-split). - **CircuitFeatures tier-2** – the treewidth proxy exists (`tw_proxy_`); independence certificates remain to be cached lazily. Optional: expose a `provsql.methods` SQL view (name, guarantee kind, applicability) for Studio. ## 5. RV probability transparency - Route the RV *probability* case `P(X