# Conditioning Conditioning – restricting a probabilistic object to the worlds where some event holds – recurs in two ProvSQL settings that look unrelated at first: discrete tuple-correlation in the style of MarkoViews, and continuous random variables. This plan argues they are *one* primitive at two carriers, extracts the conditioning-as-a-gate proposal that previously lived in [`continuous_distributions.md`](continuous_distributions.md) §D.1, folds in what the MarkoViews framework teaches, and grounds the whole thing in concrete use cases. Anchored on: - MarkoViews: Jha & Suciu, *Probabilistic Databases with MarkoViews*, PVLDB 5(11), 2012 (`https://vldb.org/pvldb/vol5/p1160_abhayjha_vldb2012.pdf`). - Koch & Olteanu, *Conditioning Probabilistic Databases*, PVLDB 1(1), 2008 – the MayBMS assertion operation, the closest existing precedent for conditioning a relational PDB on observed evidence. - The base-independence backbone in [`continuous_distributions.md`](continuous_distributions.md) (§Theoretical backbone) and its §D.1 conditioning-gate sketch, which this document supersedes and expands. ## Out of scope - Correlation primitives (`gate_copula`, `gate_mvnormal`): see [`continuous_distributions.md`](continuous_distributions.md) §D.2, §A.5. Conditioning *creates* correlation as a side effect (§4 below); it is not a substitute for an explicit joint-distribution primitive. - Causal interventions (`do`-calculus): a different operator (`continuous_distributions.md` §D.4). Conditioning is observation (`see`), not intervention (`do`); keeping them distinct is the point of that entry. ## 1. One primitive, two carriers The conditioning *event* is, in both settings, a Boolean provenance circuit `C` over base variables. What differs is the object being conditioned: - **Probability carrier (discrete).** `P(Q | C)`: a real number. The answer probability of a query, restricted to worlds where `C` holds. This is the MarkoViews / MayBMS setting. - **Distribution carrier (continuous).** `X | C`: a `random_variable`. The distribution of an RV restricted to the worlds where `C` holds, which must itself flow into later arithmetic, comparators, and aggregates. Both rest on the same architectural backbone (`continuous_distributions.md` §Theoretical backbone): base events are independent (tuple-existence Bernoullis discretely; base `gate_rv` draws continuously), and correlation enters only through operations that share base variables. Conditioning is exactly such an operation: it couples everything in the footprint of `C`. The discrete and continuous cases are then the same construction reading off two different carriers. ## 2. MarkoViews: the discrete precedent An MVDB is `(Tup, w, V)`: possible tuples with weights, plus **MarkoViews** – UCQ views `V(x̄)[w] :- Q` that attach a weight to every output tuple, read as odds. `w > 1` is positive correlation among the contributing tuples, `w < 1` negative, `w = 1` independence, `w = 0` a hard constraint (the view must be empty). It is a Markov Logic Network whose features are grounded UCQ atoms, with `P(world) = Φ/Z`. The paper's whole content is **one reduction** (their Theorem 1): query evaluation on an MVDB collapses to ordinary *tuple-independent* evaluation plus a conditioning step. - Each MarkoView `V_i` gets a fresh independent relation `NV_i` ("view `i` violated") with tuple probability `1 - w` (weight `(1-w)/w`). - `W = ⋁_i (NV_i ∧ body_i)` is a query-independent "some constraint is violated" formula. - Then ``` P(Q) = ( P₀(Q ∨ W) − P₀(W) ) / ( 1 − P₀(W) ) ``` which is precisely the conditional `P₀(Q | ¬W)` on a tuple-independent database. Read in ProvSQL terms, this is **conditioning the query's provenance on a constraint circuit `¬W`**. Four lessons fall out: 1. **The discrete carrier needs no new gate.** The exact formula is two *unconditional* evaluations of the existing `probability_evaluate` dispatcher plus arithmetic: `P(Q | C) = P(Q ∧ C) / P(C)`. The conditioning *gate* (§3) earns its keep only where the result must flow onward as an object – the continuous carrier, or a materialised discrete posterior (§3, materialised conditional tables). 2. **Conditioning is how you get correlation, and where independence shortcuts must back off.** MarkoViews' entire purpose is to make previously-independent tuples correlated by conditioning on a shared constraint. This is the discrete face of the `FootprintCache` caveat in §3: `cond(X, A)` has effective footprint `footprint(X) ∪ footprint(A)`. 3. **Exact methods only, when weights leave `[0, 1]`.** When `w > 1` the synthetic probability `1 - w` is *negative*; the intermediate INDB is not a real PDB. Every exact method (Shannon expansion, inclusion-exclusion, OBDD, tree decomposition, d-DNNF linear evaluation) is correct over negative numbers and the final `P(Q)` lands in `[0, 1]`, but sampling and bound-based approximation break. If ProvSQL ever admits such inputs, `probability_evaluate`'s dispatcher must route them exact-only and refuse `monte_carlo` / `independent`. This mirrors the continuous "rare-evidence rejection sampling is fragile" failure mode (§4). 4. **Tractability is a property of the conditioned circuit.** A query tractable on its own can become intractable once `W` is conjoined; MarkoViews is tractable iff *both* `Q ∨ W` and `W` are safe. For ProvSQL's inversion-free / `boolean_provenance` safe-query rewriter, the safety analysis must run on the constraint-augmented circuit, not the bare query. Structurally the augmentation is only a root `plus` joining `Q`'s circuit and `W`'s, and the lineage of `Q ∨ W` is no larger than the two combined, so the construction is cheap; the *analysis* is what must be re-run. ## 3. Conditioning as a gate (continuous carrier) Extracted and lightly expanded from [`continuous_distributions.md`](continuous_distributions.md) §D.1. A `gate_conditioned(rv_subcircuit, bool_subcircuit)` meaning "the distribution of `rv` restricted to the event where `bool` holds". Self-contained: it flows through any subsequent operation. Sampling is rejection (already the MC fallback's behaviour when `prov` is passed); analytical evaluation reuses the existing closed-form paths because they are already conditional internally. **Pipeline placement.** - **Simplifier** gets `cond(cond(X, A), B) → cond(X, A ∧ B)`, `cond(X, true) → X`, and `cond(X, A) → X` when `A` is independent of `X`'s footprint (the `FootprintCache` already gives you this). The first rule is the continuous twin of MarkoViews folding another view into `W`: evidence accumulates into one event rather than nesting. - **RangeCheck** treats `cond(X, X ∈ [a, b])` as truncation. The current closed-form-truncated path becomes the *specialisation* of a general `gate_conditioned` rule rather than a parallel codepath; two near-parallel codepaths collapse into one. - **AnalyticEvaluator** picks up conditional CDFs where they exist; conditioning on an independent event factors as `P(A) × (unconditional CDF)`, the continuous analog of MarkoViews' exact `P(Q ∧ C) / P(C)`. - **Expectation** semiring: every dispatcher that already takes `prov uuid DEFAULT gate_one()` becomes the special case "no explicit conditioning gate at the root", unifying the conditioning argument and the conditioning gate. - **FootprintCache** caveat: `cond(X, A)` has effective footprint `footprint(X) ∪ footprint(A)`. The structural-independence shortcut on `gate_arith TIMES` must back off accordingly. *This is the one soundness risk and should land with a regression test that constructs `cond(X, A) * cond(Y, A)` and confirms the shortcut does not fire.* It is the same coupling MarkoViews exploits deliberately (§2, lesson 2). **New directions it opens.** - **Materialised conditional tables.** Store `cond(rv, evidence)` in a regular `random_variable` column and drop the source tuples. Solves the "carrying both the distribution and its conditioning" problem, and is the continuous counterpart of the MayBMS assertion operation, which folds observed evidence back into the stored representation. - **Sequential Bayesian updates.** Each piece of evidence is another `cond(..., new_event)` wrap; the `A ∧ B` fold avoids depth blow-up. - **Truncation generalises** to the canonical degenerate case of same-RV-comparator conditioning. - **Shapley over evidence** (`continuous_distributions.md` §E.1): with conditioning plus the existing Shapley machinery, "which observation most shifted my posterior moment?" is mostly connecting code. **UI.** A `provsql.condition(rv, event_uuid)` function, plus optionally an infix `|` operator reading as "given": ```sql -- Bayesian update with materialisation UPDATE patient_risk SET risk = provsql.condition( risk, (SELECT provenance() FROM tests WHERE patient_id = 1 AND result = 'positive') ) WHERE patient_id = 1; -- Operator sugar; RangeCheck recognises a same-RV bool as truncation SELECT expected(measurement | (measurement > 0.5)) FROM sensor_readings; -- Recursive Bayesian update over an evidence log WITH RECURSIVE updates(step, dist) AS ( SELECT 0, provsql.normal(0, 10) UNION ALL SELECT step + 1, provsql.condition(dist, e.evidence_token) FROM updates u, evidence_log e WHERE e.confirmed AND e.step_idx = u.step + 1 ) SELECT expected(dist), variance(dist) FROM updates WHERE step = (SELECT max(step) FROM updates); ``` ## 4. Where the two carriers meet A single, carrier-parametric `gate_conditioned` serves both: its result type follows its first child. When the first child is a `random_variable` subcircuit, the result is a distribution (§3); when it is a Boolean provenance root, the result is a probability and the gate is usually unnecessary, since `P(Q ∧ C) / P(C)` reuses the existing dispatcher (§2, lesson 1). The gate is mandatory only when a discrete posterior must be *materialised* and queried again later. Two properties are shared across carriers and worth stating once: - **Conditioning manufactures correlation.** This is a feature discretely (it is the whole MarkoViews mechanism) and a hazard for the optimiser continuously (the `FootprintCache` back-off). Same phenomenon: the conditioning event couples everything in its footprint. - **Conditioning is robust under exact evaluation, fragile under sampling.** Discretely, negative synthetic weights force exact-only (§2, lesson 3). Continuously, rejection sampling degrades as `P(event) → 0`. In both, prefer the closed-form / analytic path; treat the sampling path as a fallback that must warn on rare evidence. ## 5. Soft / weighted conditioning The one direction MarkoViews points to beyond the §D.1 sketch. §3 models only **hard** conditioning: the event holds, the object is restricted. A MarkoView with finite weight `w ≠ 0` is **soft** conditioning: it reweights worlds rather than restricting them. The continuous analog is `gate_conditioned(X, event, weight)` that *reweights* the distribution by a likelihood rather than truncating it, which is exactly the move from rejection sampling to **importance / likelihood weighting**. This makes rare-evidence conditioning tractable instead of fragile, and gives soft evidence ("this observation is 90% reliable") a first-class form. Hard conditioning is then the `weight → ∞` (or indicator-likelihood) limit, as in the MLN reading where `w = ∞` is a hard constraint. Lower priority than hard conditioning, but it shares the gate type and the evaluator hooks, so it lands as a parameter rather than a new mechanism. ## 6. Concrete use cases Phrased against ProvSQL's actual surface. Conditioning is *half-built* today, which is the most useful thing to see before scoping the work: the moment / sample / histogram dispatchers already take a `prov` conditioning event, and `repair_key` already conditions on a key constraint, but conditional *probability*, a conditioned *distribution that flows onward*, and general constraints are missing. §6.A is what works now; §6.B is the gap the primitive closes. ### 6.A Already expressible today #### 6.A.1 Conditional moments, samples, and histograms of a probabilistic scalar The `prov` argument on `expected` / `variance` / `moment` / `central_moment` / `support` / `rv_sample` / `rv_histogram` *is* a conditioning event: they compute `E[X^k | prov]` for both a `random_variable` and an `agg_token` (the C dispatcher documents exactly this). So conditional expectation, variance, sampling, and histograms of one scalar already work — including the canonical truncation and the conditional-Value-at-Risk / stress-test shape. ```sql CREATE TABLE air_quality(sensor int, reading provsql.random_variable); INSERT INTO air_quality VALUES (1, provsql.normal(12.0, 4.0)) /* … */; -- E[reading | reading >= 0]: a physical sensor cannot report a negative -- concentration. The same-RV comparator event is what RangeCheck folds -- into a truncation. SELECT a.sensor, expected(a.reading, (SELECT provenance() FROM air_quality v WHERE v.sensor = a.sensor AND v.reading >= 0)) FROM air_quality a; -- Conditional VaR: expected loss given a market crash, and its tail -- histogram, both via the same conditioning argument. SELECT expected(portfolio_loss, crash.tok), rv_histogram(portfolio_loss, 30, crash.tok) FROM positions, (SELECT provenance() AS tok FROM market WHERE index_return < -0.10) crash; ``` That `agg_token` also accepts `prov` means conditional moments of a discrete GROUP BY aggregate (`E[SUM(x) | event]`) work the same way. #### 6.A.2 Key / functional-dependency constraints via `repair_key` A hard key constraint — "at most one tuple per key", MarkoViews' weight-0 denial view `V2` ("a person has one advisor") in its keyed special case — is `repair_key` today. It turns a table into mutually-exclusive, renormalised blocks, after which any `probability_evaluate` is implicitly conditioned on the FD holding. ```sql SELECT repair_key('advisor_candidates', 'advisee'); -- each advisee's candidate advisors are now a BID block; downstream -- probabilities are conditioned on "advisee -> advisor" ``` The TID/BID classifier in `classify_query.c` already separates mutual-exclusion (BID) from independence (TID); BID blocks are exactly the weight-0, single-attribute instance of constraint conditioning. ### 6.B What needs the conditioning primitive #### 6.B.1 Conditional probability of a discrete answer, `P(Q | C)` `probability_evaluate(token, method, arguments)` has **no** `prov` argument — the discrete twin of §6.A.1 is the missing piece. Today it is a clumsy two-call `P(Q ∧ C) / P(C)`, and nothing stops someone reaching for `monte_carlo` on a negative-weight (MarkoViews) circuit. ```sql -- Entity resolution: "P(records 42 and 88 match | 17 and 42 confirmed)". -- matches(a, b) is a probabilistic table (add_provenance + set_prob), -- correlated through a transitivity rule. SELECT probability_evaluate(times(m.provsql, ev.tok)) -- P(Q ∧ C) / probability_evaluate(ev.tok) -- P(C) FROM matches m, (SELECT provsql AS tok FROM matches WHERE a = 17 AND b = 42) ev WHERE m.a = 42 AND m.b = 88; ``` The fix is a `probability_evaluate(token, prov => …)` overload lowering to exactly this, with the exact-only guard (§2, lessons 1 and 3). This is the canonical MarkoViews / MayBMS scenario: conditioning a PDB on newly-observed certain facts. #### 6.B.2 A conditioned distribution that flows onward §6.A.1 conditions a *moment*; it cannot return a `random_variable` to store or compose. So **sequential Bayesian update is not expressible today** — each step needs the posterior as a first-class value. This is the core motivation for `gate_conditioned` and for materialised conditional tables. ```sql -- needs gate_conditioned: the posterior is itself a random_variable WITH RECURSIVE belief(step, dist) AS ( SELECT 0, provsql.normal(20.0, 5.0) -- prior on the quantity UNION ALL SELECT u.step + 1, provsql.condition(u.dist, o.evidence_token) FROM belief u, observation_log o WHERE o.confirmed AND o.step_idx = u.step + 1 ) SELECT expected(dist), variance(dist) FROM belief WHERE step = (SELECT max(step) FROM belief); -- materialised conditional table: fold a (probabilistic) positive test -- into a patient's stored risk distribution UPDATE patient_risk SET risk = provsql.condition( risk, (SELECT provenance() FROM tests WHERE patient_id = 1 AND result = 'positive')) WHERE patient_id = 1; ``` #### 6.B.3 Arbitrary denial constraints, beyond keys `repair_key` (§6.A.2) covers key FDs only. A general denial constraint — "no two overlapping bookings", "an advisor must have been faculty" — is conditioning on an arbitrary UCQ no-violation event, the full MarkoViews `¬W`. There is no surface for it today; it wants the same `condition(token, no_violation_event)` plumbing as §6.B.1 with the constraint circuit supplied by a helper. #### 6.B.4 Explainable inference: Shapley over evidence `shapley(token, var, …)` exists but has no conditioning argument. Once §6.B.2 lands, "which observation most shifted the posterior expected risk?" is `shapley` over a `gate_conditioned` root — connecting code over existing machinery, unique to a provenance-aware system (`continuous_distributions.md` §E.1). ```sql -- after gate_conditioned: posterior_risk is a conditioned random_variable SELECT evidence_id, provsql.shapley(posterior_risk, evidence_id, payoff => 'expected') FROM posteriors, evidence_atoms WHERE patient_id = 1; ``` #### 6.B.5 Soft evidence / likelihood weighting Evidence is rarely certain: a test trusted at 90% is soft conditioning, not hard. Discretely this is a finite-weight MarkoView; continuously it is the weighted gate of §5, reweighting rather than truncating. Neither exists today. ```sql -- update belief with an observation trusted at 90% SELECT provsql.condition(prior, evidence_token, weight => 0.9); ``` ## Priorities 1. **Hard conditioning on the continuous carrier (§3).** The most leverage per architectural unit: it collapses the existing RangeCheck truncation codepath into one general mechanism, promotes the existing `prov` moment-conditioning argument (§6.A.1) from "moment only" to "a distribution that flows onward", and unblocks §6.B.2. Lands after the §F.1 per-distribution refactor in `continuous_distributions.md`, alongside the first architectural batch. The single soundness risk is the `FootprintCache` back-off; ship it with the `cond(X,A)*cond(Y,A)` regression test. 2. **Discrete event-conditioning as a thin SQL surface (§2, lesson 1).** A `probability_evaluate(token, prov => …)` overload lowering to `P(Q ∧ C) / P(C)` on the existing dispatcher, with the exact-only guard for out-of-range inputs. No new gate. Delivers §6.B.1 and §6.B.3 (the latter once a no-violation-event helper exists). 3. **Materialised conditional tables (§3).** The MayBMS-style assertion: store and re-query a conditioned object. Needed once §6.B.2's running posteriors want persistence. 4. **Soft / weighted conditioning (§5).** Parameter on the same gate; bridges to importance weighting and soft evidence (§6.B.5). Lower priority, no new mechanism. 5. **Shapley over evidence (§6.B.4).** Research track; connecting code over (1) plus existing Shapley infrastructure. ## Implementation observations - The MarkoViews reduction shows the discrete carrier wants *no new gate*: `P(Q | C) = P(Q ∧ C) / P(C)` is two existing evaluations and a division. Resist adding a discrete conditioning gate; reserve the gate for distribution-valued and materialised results. - Conditioning is the canonical example of a circuit operation that *defeats* independence shortcuts. Any optimisation keyed on `FootprintCache` disjointness must treat `cond` as widening the footprint to the union; the regression test belongs with the feature, not after it. - Negative / out-of-`[0,1]` input probabilities are sound for the exact methods and only for them. A conditioned circuit (or a MarkoViews-style augmented one) should carry a flag that the dispatcher reads to refuse sampling-based methods, paralleling the rare-evidence rejection warning on the continuous side. - Safety / inversion-freedom must be re-evaluated on the conditioned circuit, never on the bare query. The augmentation is a cheap root `plus`, but the safe-query analysis is not invariant under it.