# Conditioning Conditioning – restricting a probabilistic object to the worlds where some event holds – unifies discrete tuple-correlation (in the style of MarkoViews) and continuous random variables as *one* primitive at two carriers: a probability `P(Q | C)` and a distribution `X | C`. The conditioning event is, in both, a Boolean provenance circuit `C` over base variables; correlation between the conditioned object and the evidence is handled automatically by content-addressing (a shared base tuple is the same input gate in both circuits, so `P(X∧C)` over `times(X, C)` is the true joint). The design rationale, the MarkoViews reduction it rests on (`P(Q | C) = P(Q ∧ C) / P(C)`, two unconditional evaluations plus a division), and the surface that shipped are recorded in the git history, the user manual (`doc/source/user/conditioning.rst`), and case study 8 (`doc/source/user/casestudy8.rst`). This file keeps only the open work. Anchored on: - MarkoViews: Jha & Suciu, *Probabilistic Databases with MarkoViews*, PVLDB 5(11), 2012. - Koch & Olteanu, *Conditioning Probabilistic Databases*, PVLDB 1(1), 2008 – the MayBMS assertion operation, the closest existing precedent. - The base-independence backbone in [`continuous_distributions.md`](continuous_distributions.md). ## Out of scope - Correlation primitives (`gate_copula`, `gate_mvnormal`): see [`continuous_distributions.md`](continuous_distributions.md) §D.2, §A.5. Conditioning *creates* correlation as a side effect; it is not a substitute for an explicit joint-distribution primitive. - Causal interventions (`do`-calculus): a different operator (`continuous_distributions.md` §D.4). Conditioning is observation (`see`), not intervention (`do`). ## Open work ### Arbitrary denial constraints, beyond keys `repair_key` covers key FDs only (a table becomes mutually-exclusive renormalised blocks, after which any `probability_evaluate` is implicitly conditioned on the FD holding). A general denial constraint – "no two overlapping bookings", "an advisor must have been faculty" – is conditioning on an arbitrary no-violation event, the full MarkoViews `¬W`. **The expressivity is shipped.** The violation event `W` is an ordinary query (`SELECT provenance() FROM GROUP BY ()`, building `⋁ (a ∧ b)` with no manual gates), and the prefix `!` operator (`provenance_not`) negates it, so `Q | !W` conditions a query on the constraint holding – the MarkoViews `P(Q | ¬W)` reduction, demonstrated end to end in case study 8. The one residual is an *optimisation*, not an expressivity gap: for the `boolean_provenance` linear path to apply to a conditioned-on-constraint query, the safety / inversion-freedom analysis has to re-run on the constraint-augmented circuit (a cheap root `plus` joining `Q`'s circuit and `W`'s), not the bare query, since safety is not invariant under the augmentation. Workload-gated; the probability is already computed correctly by the general methods today. ### Materialised, re-based discrete posterior (MayBMS-style) A stored conditioned `uuid` token works for the store / re-condition / evaluate cycle (it is terminal, so re-conditioning folds and `probability_evaluate` reads it directly). The **one** thing it cannot do is re-enter relational algebra: joining or unioning a conditioned tuple raises rather than burying the global `P(C)` normalization under further algebra. Making a stored posterior re-composable needs a *re-based* posterior – fold the evidence in and treat the result as a fresh independent representation, the MayBMS assertion operation. Deferred until a persisted posterior must actually be composed onward. ### Shapley over evidence (research track) `shapley` / `banzhaf` on a conditioned token are currently *refused* with a clear error (`src/shapley.cpp`: Shapley / Banzhaf are linear over the semiring, conditioning is not). "Which observation most shifted my posterior moment?" is connecting code over a `gate_conditioned` root plus the existing Shapley machinery and is unique to a provenance-aware system, but the conditional value *definition* is the research question to settle first. (Borders `continuous_distributions.md` §E.1.) ### Soft / weighted conditioning (explicitly not a priority) Hard conditioning only, by decision. A finite-weight MarkoView / likelihood-weighted gate `gate_conditioned(X, event, weight)` reweights worlds rather than restricting them (the move from rejection sampling to importance / likelihood weighting: makes rare-evidence conditioning tractable instead of fragile, and gives soft evidence – "this observation is 90% reliable" – a first-class form). Hard conditioning is the `weight → ∞` (indicator-likelihood) limit. It shares the gate type and the evaluator hooks, so it would land as a parameter on the same gate, not a new mechanism. When it lands, weights leaving `[0, 1]` make the synthetic intermediate probabilities negative, so the `probability_evaluate` dispatcher must route such inputs **exact-only** and refuse `monte_carlo` / `independent` (the discrete twin of the continuous rare-evidence rejection-sampling fragility). ## Priorities 1. **Arbitrary denial constraints beyond keys.** The one open item with a concrete surface (a `condition(token, no_violation_event)` helper building `¬W`, with the safety analysis re-run on the augmented circuit). Workload-gated. 2. **Materialised re-based discrete posterior (MayBMS-style).** Deferred until a stored posterior must re-enter join / union. 3. **Shapley over evidence.** Research track; connecting code over the shipped surface once the conditional Shapley value is defined. 4. **Soft / weighted conditioning.** Explicitly not a priority; a parameter on the existing gate (with the exact-only dispatcher guard) when a use case makes the case. ## Implementation observations - The discrete *evaluation* needed no new machinery: `P(Q | C) = P(Q ∧ C) / P(C)` is two existing evaluations and a division. The `gate_conditioned` gate exists for the `uuid` carrier not to evaluate a one-shot `P(Q|C)` but so the operator's result is a storable, re-conditionable token; it is the **terminal** kind, non-composable with the semiring gates. The thing to resist is a *composable* discrete conditioning gate that would bury the global `P(C)` normalization under `plus` / `times`. - Conditioning is the canonical circuit operation that *defeats* independence shortcuts: any optimisation keyed on `FootprintCache` disjointness must treat `cond(X, A)` as widening the footprint to `footprint(X) ∪ footprint(A)`. The regression test belongs with the feature (and shipped with it), not after it. This is the same coupling MarkoViews exploits deliberately. - Negative / out-of-`[0,1]` input probabilities (soft conditioning, or a MarkoViews-style augmented circuit) are sound for the exact methods and only for them; the dispatcher must refuse sampling-based methods on such circuits. Hard conditioning on `[0,1]` evidence needs no such guard – every exact method is correct and the result lands in `[0,1]`. - Safety / inversion-freedom must be re-evaluated on the conditioned (or constraint-augmented) circuit, never on the bare query: the augmentation is a cheap root `plus`, but the safe-query analysis is not invariant under it.