# Safe-Query Follow-Ups Open items surfaced during the safe-query / `provsql.boolean_provenance` discussion that are deferred but worth coming back to. The work that has already landed is described in the *Safe-Query Rewriter* section of `doc/source/dev/query-rewriting.rst`. This file is organised by **priority**: Tier 2 (reusable prerequisites and certificates) and Tier 3 (workload-gated). The HAVING-clause optimisation layers (Tier 1) and the base FD-aware rewriter extensions have all landed; what remains there is a single planner-side certificate, folded into Tier 2 below. ## Tier 2 – reusable prerequisites and certificates - **Inversion-free `UCQ(OBDD)` extensions.** The inversion-free path is implemented (the `UCQ(OBDD)` rung of Jha & Suciu, ICDT 2011: a detector plus an in-process structured-d-DNNF builder over the Prop. 4.5 query-derived order, with non-integer key columns, deterministic-relation filters, and SPJ / nested view flattening); see the *Inversion-Free UCQ(OBDD) Path* section of :cfile:`doc/source/dev/probability-evaluation.rst`. Three extensions to that class remain open, in increasing difficulty: - **UNION inside a view** (the inlined subquery is a `setOperations` UCQ): flatten per branch with the order markers fanned across branches. The class is defined for UCQs, so this is a flattening extension, not a core change; it declines gracefully today. *Sensible next slice.* - **Functional dependencies do not fit this builder.** The canonical FD-tractable queries (the H-query `R(x),S(x,y),T(y)` under a PK on `S`, the two-PK triangle) have no single root class, which the detector and the single global Prop. 4.5 order both require; the read-once path handles them via per-atom anchors precisely because no single root exists, and they are anyway safe-but-not-poly-OBDD (read-once territory). *Out of reach for a surface tweak to the precedence and positional checks; needs the per-branch builder below.* - **Per-branch decision orders (FBDD / Decision-DNNF).** The builder is OBDD-class: one global order. The `h_k` queries (Beame, Li, Roy & Suciu) have inversions, so no poly OBDD, yet admit poly FBDD / d-DNNF with different orders on different branches. Choosing the decision variable per branch would capture strictly more, and subsume the FD case above. *Research extension, not a surface one.* - **Planner-side safe-aggregate-plan certificate (`CERT_SAFE_AGG_PLAN`).** The probability-side HAVING engine certifies the independence of `gate_agg` children *circuit-only*: `runAggMarginalEvaluator` (`src/AggMarginalEvaluator.{h,cpp}`) recurses through any hierarchical (laminar) join self-gating on whether each multi-member block has a leaf common to all its members. What remains is the planner certificate for shapes the circuit *cannot* self-certify -- notably BID disjoint-block (`⊥`) structure, where the mutual exclusivity is not visible from leaf sharing alone. The certificate would be stamped on the per-row provenance root (the same carrier as `CERT_INVERSION_FREE` in `src/safe_query_cert.{c,h}`) and read back by the evaluator; the enum in `safe_query_cert.h` has no `CERT_SAFE_AGG_PLAN` value yet. ## Tier 3 – workload-gated These widen coverage or reach harder query classes, but each is explicitly deferred until a real workload motivates it. - **Discrete `random_variable` extensions.** `random_variable` today carries continuous distributions (Normal, Uniform, Exponential, Erlang) plus Categorical and Mixture, and the HAVING evaluators assume the `m_i` extracted from `gate_semimod` is a deterministic integer. Extending `gate_rv` (and the surrounding analytic evaluators) to discrete distributions (Poisson, Binomial, Geometric, Multinomial) opens a parallel optimisation track for HAVING when the *aggregated value* itself is uncertain: - Sum of independent Poissons is Poisson with summed rate; closed-form CDF on the surrounding `gate_cmp`. Same shape as the existing Normal-sum closed form in `Expectation.cpp`. - COUNT(*) over Bernoulli presence-tokens is Poisson-binomial; same machinery as the landed `CountCmpEvaluator`, now generic to "the count of present rows is itself a random variable" rather than Boolean-specific. - General convolution of independent discrete RVs: O(C × N) DP for the surrounding cmp. **Integration point.** An analogue of `runAnalyticEvaluator` (`src/AnalyticEvaluator.{h,cpp}`) that walks `gate_cmp(gate_agg( gate_arith over gate_rv children), gate_value)`, computes the analytical CDF, and replaces the cmp with a Bernoulli `gate_input` leaf. Probability-specific (the `gate_input` carries a numeric probability, semiringly meaningless to symbolic semirings); lives in `probability_evaluate.cpp` before `getBooleanCircuit`, the same slot as the existing analytic / hybrid passes. **Architectural fit.** The extension stays inside the existing intensional pipeline: new distribution variants in the `gate_rv` `extra` blob, new closed-form CDF entries in the analytic evaluator, new structural reductions in the HAVING pre-pass. No new gate type; no change to the circuit framework itself. The MC fallback in `MonteCarloSampler` already handles cases where the closed form does not apply. - **Möbius / inclusion-exclusion via Monet 2019's construction.** Beyond hierarchical CQs, the Dalvi-Suciu dichotomy puts many non-hierarchical UCQs in PTIME, with the canonical algorithm being lattice-walking inclusion-exclusion which is naturally extensional. Monet (arXiv:1912.11864) closes the architectural gap by giving a PTIME construction of a deterministic-decomposable circuit, using negation in place of IE; the output drops straight into our intensional pipeline (`gate_monus` already means `AND(a, NOT(b))` under Boolean, so no new gate type required). *Deferred; the architectural mismatch is no longer the blocker but practicality is open. The construction is PTIME but the polynomial degree depends on query structure, not linear in database size; the emitted circuit may be much larger than the natural lineage; the safety detector is meaningfully more expensive than the hierarchical one; the paper covers a strict subset of safe non-hierarchical UCQs (the inversion-free fragment); and no implementation has appeared in the six years since the paper, so there is no empirical evidence for the practical regime. The honest cost-benefit is that for any safe non-hierarchical workload we encounter today, handing the natural lineage to d4 is likely as fast or faster than constructing-then-evaluating Monet's circuit. Revisit only if a benchmark surfaces where d4 visibly chokes on a safe non-hierarchical UCQ and Monet's construction plausibly helps.* ### Hierarchical-detector follow-ups (rewriter coverage) The FD-aware safe-query rewriter currently lands six extensions on top of the base hierarchical-CQ detector: constant-selection elimination, PK / NOT-NULL UNIQUE FDs from the catalog, deterministic-relation transparency, PK-unifiable self-joins, FD closure on the union-find (detector-only), and disjoint-constant self-joins. See :cfile:`src/safe_query.c` and the dev-doc *Safe-Query Rewriter* section for the implementation. The items below were surfaced during that work and deliberately deferred for a future slice. - **FD-induced nested rewrite (function/free split).** The current FD closure accepts every query whose FD-reduced atom-sets are pairwise nested-or-disjoint *and* whose existing single-level wrap is already read-once. When the FD closure splits the atoms into a *function layer* (PK-determined relations) and a *free layer* (no FD on them) and the function layer indexes into the free layer, the single-level wrap duplicates each function-layer atom's `provsql` across the per-row `gate_times`, breaking the read-once invariant. The canonical motivating case is the composition of constant-selection elimination with a PK on the same relation: a constant pinning combined with a PK that determines another column collapses an atom that neither pass handles in isolation. The nested rewrite would emit function-layer atoms as independent `gate_plus` subqueries Cartesian-joined with the free-layer at the top, factoring each function-layer token out as a single OR'd input. *Deferred per the cost/payoff ordering: the triangle-with-two-PKs pattern is real but not common, the implementation cost is meaningful (250+ lines of rewriter surgery, careful interaction-testing with multi-component and inner-group paths), and the existing FD closure already lands the most common shapes through the constant-selection pass's multi-component routing for pinned atoms. Revisit only after a real workload surfaces a triangle-with-two-PKs pattern that the existing rewrite cannot handle.* - **Self-joins without PK or constant rescue.** Symmetric closure `R(x, y), R(y, x)`, path-of-length-2 `R(x, y), R(y, z)`, and similar shapes remain hard. Neither the PK-unification pass nor the disjoint-constant certification fires: there is no key whose columns are equated pairwise, and the constant predicates either don't exist or aren't pairwise disjoint. (These shapes all carry an inversion, hence are not even in `UCQ(OBDD)`; the inversion-free self-joins (consistent unification, e.g. `S(x,y),A(x,y),S(x,z),B(x,z)`) are the easier subset already handled by the implemented inversion-free path, see *Inversion-free `UCQ(OBDD)` extensions* in Tier 2.) Resolving the read-once question for these shapes requires either Monet-style intensional dDNNF construction (see the *Möbius / Monet* entry above) or proper handling of the full Dalvi & Suciu 2012 JACM dichotomy for UCQs (which subsumes self-joins implicitly via the lattice-of-valuations criterion). *Deferred: both directions are larger projects than the FD-aware extensions combined, and the workloads where the symmetric-closure / path shapes show up tend to be recursive-query territory that ProvSQL's CQ-only rewriter already excludes.* - **Soft keys.** Functional dependencies that hold probabilistically rather than absolutely, per Jha, Rastogi & Suciu (PODS 2008). Separate axis from schema FDs: the rewrite would have to weight each FD's consequent by its observed reliability rather than assume it holds in every world. *Deferred; revisit if a real workload makes the case.* - **FD chases through views and CTAS-derived relations.** The PK-FD pass is currently restricted to FROM lists of base relations. Extending it through views and `CREATE TABLE AS` -derived tables was blocked on the TID/BID-propagation work that landed in the `safe_queries` branch (view descent in the safe-query rewriter, CTAS lineage hook, ancestry-based disjointness gate, independent-TID join inference, BID block-key preservation under projection / GROUP BY). The propagation prerequisites are now in place; the remaining work is to teach the PK-FD pass itself to follow the lineage rather than stopping at view / CTAS-derived RTEs. *Still deferred; the CTAS-correlation trap on deterministic atoms (`CREATE TABLE foo AS SELECT * FROM ` without `add_provenance`) is the most visible symptom, but in practice it's covered by the CTAS hook seeding ancestry on lineage-bearing CTAS now.* - **Data-safe plans.** FDs that hold on the *instance* but not in the schema, per Jha, Olteanu & Suciu (EDBT 2010). Larger project; the rewrite would need to certify the FD against the actual data before trusting it. Separate from the schema-FD-aware analysis the FD closure currently implements. *Deferred; revisit only with a clearly-motivated use case.* ### TID / BID propagation follow-ups The TID / BID propagation roadmap shipped in full on the `safe_queries` branch with one exception, deliberately deferred for the tradeoff described below. - **`UNION ALL` of compatible BID legs (was "Slice E").** The classifier reports OPAQUE on a `UNION ALL` whose legs are BID under the same block-key column at the same target-list position, and the CTAS lineage hook propagates that as no metadata recorded for the derived relation. This is correct (no false TID / BID classifications), just a missed optimisation. *Why not implemented.* A row from leg A and a row from leg B with the same block-key value are independent (different source relations), not mutually exclusive, so the union is not BID under the natural key. Recovering BID-ness would require synthesising a composite block-key column `(leg_id, k)` and surfacing it on the derived relation -- which changes the user-visible schema in a non-obvious way (the user wrote `CREATE TABLE t AS SELECT k FROM bid_a UNION ALL SELECT k FROM bid_b` and would end up with a table whose schema has a column they didn't ask for). The tradeoff between automatic BID coverage and surprising-column injection didn't favour the synthesis path. *Two future paths.* 1. **Disjoint-range certification**: when the legs' block-key value ranges are provably disjoint (e.g., a leg with `WHERE k < 100` UNION ALL with another with `WHERE k >= 100`), the natural `k` column suffices as the block key and no synthesis is needed. The CTAS hook could be taught to recognise such patterns, possibly via the same `safe_is_var_const_equality`-style detector the disjoint-constant self-join pass in :cfile:`src/safe_query.c` uses. ~100-150 LOC, no schema impact. 2. **Opt-in synthesis via a GUC**: `provsql.synthesize_union_leg_id = off` by default; `on` adds `__provsql_leg_id` to derived tables that would otherwise miss the BID promotion. Documented as an advanced option; only fires when the user requests it. ~200-300 LOC including parse-tree surgery for the new column and the composite block-key recording on the worker side.