# Joint-width UCQ compiler — single-sweep (shared-work) per-answer evaluation

Design + implementation notes for the "single sweep" per-answer optimisation.

> **Final state.** The top-down single DP (§3) is the materialisation engine
> behind the **transparent** per-answer route: the planner substitutes the C
> `ucq_joint_provenance_answer`, which on a query's first output group gathers
> the facts once (`ucq_joint_gather`), runs the single DP, and materialises
> every answer's certified d-D into the store, caching `head -> token` in
> `fn_extra`; each group is then an O(1) lookup. Probability / Shapley /
> expectation are the standard evaluation on the returned token. The
> standalone probability/answers SRFs and the wall-clock benchmarks named in
> the sections below were development scaffolding and have been removed; the
> sections remain as the design narrative (the shared-decomposition and
> top-down constructions, their correctness arguments, and the measured
> speed-ups). The width-stats helpers (`ucq_joint_compile_stats`) are kept.

## 0. Where we are

Per-answer evaluation (`SELECT head, probability_evaluate(provenance()) …
GROUP BY head`) is currently **per-binding**: for every answer the head is
pinned to its value and the *Boolean* compiler is rerun. Two SQL surfaces:

- **`ucq_joint_answers(query, head_vars, head_tuples, facts…)`** — the facts
  are gathered **once** (passed as columnar arrays) and the C compiler is run
  once **per head tuple** (each pins the head with a `Sel` atom + a CERTAIN
  fact and calls the Boolean `ucq_joint_evaluate`). One gather, *k* compiles.
- **transparent `ucq_joint_provenance_answer(desc, head_vars, head_vals, …)`**
  — substituted by the planner once per output group, so the relations are
  **re-gathered per answer**. *k* gathers, *k* compiles.

So the cost of *k* answers is, per binding,

```
  per-group transparent :  k · (gather + encode + decompose + DP)
  ucq_joint_answers     :      gather + k · (encode + decompose + DP)
```

## 1. What "single sweep" can share

The work an answer does decomposes into four stages, in increasing share-ability:

| stage | depends on the head value? | shareable across answers? |
|---|---|---|
| gather (SQL reads the relations) | no | **yes** — gather once |
| encode (`JointEncoding`, slice walk) | no (without the `Sel` pin) | **yes** |
| decompose (joint graph + min-fill tree decomposition) | no | **yes** |
| DP (bag sweep, gate emission) | **yes** | **yes** via the single top-down DP (§3) |

The current `Sel`-pin makes *encode* and *decompose* depend on the head (the
pin adds a fact + atom whose element is the head value), which is why
`ucq_joint_answers` cannot share them. Replacing the `Sel` pin with an
**in-DP head pin** — a `(var → element)` constraint that `closeDisjunct`
enforces, rejecting any binding of the head variable to a different element —
removes that dependency: the encoding and the decomposition become identical
for every answer, and only the DP is rerun, pinned, per head:

```
  shared-decomposition  :  gather + encode + decompose + k · DP_pinned
```

This is the implementation here. It is **correct by construction**: a pinned
DP run computes exactly `P(∃ witness with head = v)` — the same marginal the
`Sel`-pin computes — so each answer's probability is unchanged; only the
shared stages are paid once. The 223-test suite (all Boolean, i.e. the
empty-pin path) guards that the Boolean compiler is untouched.

## 2. The in-DP head pin

`closeDisjunct` is the *only* place a query variable receives a fresh binding
from a fact (the join/forget transitions only propagate existing bindings).
Pinning head variable `H` to element `v` is therefore a single guard there:
when an atom would bind `H` to bag position `p`, require `domain[p] == v`,
else drop that extension. The pin is threaded read-only through the sweep;
with an empty pin the guard never fires and the Boolean behaviour is verbatim.

A head with several variables (a multi-column `GROUP BY`) pins each component.
Non-answer head values (probability 0) fall out as the empty/false root and
are dropped, exactly as `ucq_joint_answers` drops them.

## 3. The full top-down single-DP (implemented: `compileAnswersOneDP`)

The shared-decomposition path still pays *k* bottom-up DP sweeps. The thesis's
multi-output construction (Amarilli, tel-01345836, §4.2.9) shares the DP
**itself**, emitting one circuit root per answer from a single sweep. The
obstacle was that the bottom-up state's hom-set **mixes head bindings**: a
single state carries many partial maps that may bind the head variable to
*different* elements. The implementation makes the head a **state-level key**:

- **The head is never existentially projected.** A head variable is bound like
  any other while in the bag, but when its element leaves the decomposition it
  is set `DONE` with its element **value recorded** in the code's `hval`
  (rather than collapsed to a value-less `DONE`). So different head bindings
  live in different states and never merge.
- **Completed answers are tracked per head-tuple.** Instead of the global `sat`
  collapse, a code that reaches the full witnessed mask is a *completion*: its
  head tuple is added to the state's `done` set and the code is discharged.
  `done` is part of the state key, so worlds are partitioned by which answers
  they satisfy.
- **An answer is emitted as its own root** when its head elements have all left
  the tree **and** no surviving partial code can still witness it. The second
  condition is the subtle one: a witness can complete *after* the head element
  is forgotten (the head value lives on in `hval`, and the rest of the witness
  may sit in a higher bag). Emitting on element-departure alone splits one
  answer's provenance across the lifts of its several witnesses, double-counting
  the overlap. At the root (empty parent) nothing can complete, so every
  remaining answer is emitted there.
- **All answer roots share one circuit.** A single probability pass values them
  all: `dDNNF::probabilityEvaluation` caches per-gate values, so evaluating the
  *k* roots in sequence touches each shared gate once.

The candidate answers are **discovered** by the sweep (no candidate list). The
state count grows with the number of *concurrently live* answers (those with a
pending witness crossing the current bag), which is bounded by the treewidth,
not by *k*. So the whole evaluation is **one pass, linear in the data** rather
than *k* passes.

C surface: `UCQJointCompiler::compileAnswersOneDP` + SRF
`ucq_joint_answers_onedp(query, head_vars, facts…)`.

### Correlated regime (`compileAnswersOneDPCorrelated`)

The same single-DP carries to the correlated regime, merging the answer
machinery (head-as-state-key `hval`, per-tuple `done`) with the
valuation/suspicious gate DP of `mergedCompile` (the world variables are the
slice INPUT leaves; a fact is present iff its slice gate is true). One extra
subtlety makes it correct: **a witness gate vertex can outlive the head
element**. A fact's gate is a separate vertex in the joint graph, and the
input literal it folds into the running provenance is added only when that gate
vertex leaves the bag -- which can be *above* the bag where the head element
left. Emitting at head-element departure (as the data-graph path does) would
freeze the answer's gate before its witness gates fold in, over-counting (a
single fact `R(0)=e0` came out 1.0 instead of 0.5). The fix is a **component
barrier**: an answer is emitted only when its whole connected component
(elements *and* gate vertices, union-found over each fact's clique) has left
the bag, by when every component gate has folded into the state gate. This
keeps independent answers in separate components (no co-carrying) while
correlated answers share a component and are carried together -- exactly the
correlation structure. The dispatch is in `compileAnswersOneDPImpl` (-> 
`compileAnswersOneDPCorrelated`); SRF `ucq_joint_answers_onedp_tracked` walks
the real tokens once (shared `buildTrackedFacts`) and runs the single merged
sweep. Cross-checked exact against the per-binding sweep and the ladder over
five correlated shapes (H0, self-join, RSTU chain, union, two-column head, all
joint treewidth 2) in `ucq_joint_correlated.sql`.

### Measured (same H0 family, w=2, p=0.5)

| k | per-binding ms | swept ms | onedp ms | onedp vs swept |
|---|---|---|---|---|
| 16   | 9.1      | 2.7     | 1.1   | 2.5× |
| 64   | 109.2    | 29.5    | 2.7   | 11×  |
| 256  | 1596.0   | 446.7   | 8.5   | 53×  |
| 1024 | 27047.7  | 7297.7  | 34.1  | 214× |
| 4096 | 490685.0 | 129590.0| 152.0 | 853× |

`onedp` is **linear** in *k* (≈4× per 4× answers) while per-binding and swept
are **quadratic** (each of the *k* passes sweeps all *k·w* facts). At k=4096 the
single DP is 853× faster than the shared-decomposition sweep and ~3200× faster
than per-binding, and the gap widens without bound. All three agree on every
answer. Cross-checked against per-binding in `ucq_joint_answers` section 19 and,
on a treewidth-2 instance with correlated answers, against the standard ladder.

### Measured, correlated regime (R(x) an internal times of shared events)

`test/bench/ucq_joint_onedp_correlated_bench.sql` -- the merged single sweep
vs the *k* pinned merged sweeps of `ucq_joint_answers_swept_tracked`:

| k | swept ms | onedp ms | onedp vs swept |
|---|---|---|---|
| 16  | 13.6     | 2.9   | 4.7× |
| 64  | 179.9    | 8.0   | 22×  |
| 256 | 2640.6   | 30.5  | 87×  |
| 512 | 10841.5  | 61.3  | 177× |

Same linear-vs-quadratic separation as the data-graph regime (177× at k=512),
both exact.

## 4. Scope of the implementation

Both regimes share the decomposition the same way; the in-DP head pin is the
single primitive that makes it sound (it lives in `closeDisjunct`, which both
regimes use to bind a query variable from a fact).

- **Data-graph (TID/BID) regime.** New C SRF `ucq_joint_answers_swept(query,
  head_vars, facts…)` → `SETOF (head int[], probability float8)`:
  gather/encode/decompose once, loop the pinned DP over the distinct head
  tuples in the data. Cross-checked against `ucq_joint_answers` (per-binding):
  identical answers and probabilities.
- **Correlated regime (internal-gate tokens).** `mergedCompile` gained the
  same optional `shared_td` / `shared_elim` / `head_pin` parameters and threads
  the pin through its `closeFact` → `closeDisjunct`; `compileImpl` forwards them
  and `compileAnswers` accepts the correlated encoding. The joint
  graph spans **data and circuit slice**, but neither depends on the head, so
  it is built once and only the pinned merged DP (the per-world gate valuation +
  hom DP) is rerun per answer. New C SRF `ucq_joint_answers_swept_tracked(query,
  head_vars, head_tuples, fact_rel, fact_elems, fact_arity, fact_tokens)`: the
  fact tokens are real provenance gates, walked through the circuit **once**
  (shared `buildTrackedFacts` helper), the correlated encoding + decomposition
  built once, each answer a pinned sweep. Cross-checked against the standard
  ladder per group (`test/sql/ucq_joint_correlated.sql`): exact, e.g. two
  answers q(0)=0.125 (shared e0) and q(5)=0.5 (three facts sharing e3).

## 5. Benchmark (single sweep vs multiple passes)

`test/bench/ucq_joint_single_sweep_bench.sql`: a per-answer family with a
growing number of answers *k*, reporting wall-clock for

1. per-group transparent (k gathers + k compiles),
2. `ucq_joint_answers` per-binding (1 gather + k compiles),
3. `ucq_joint_answers_swept` shared-decomposition (1 gather + 1 decompose + k
   pinned DPs),

all three agreeing on the answer set and the marginals.

### Measured (q(x) :- R(x), S(x,y), T(y), w=2 private witnesses, p=0.5)

| k (answers) | transparent ms | per-binding ms | swept ms | per-binding / swept |
|---|---|---|---|---|
| 16   | 110.4  | 9.2     | 2.8    | 3.3× |
| 64   | 1359.5 | 121.0   | 35.0   | 3.5× |
| 128  | 5250.2 | 478.4   | 127.2  | 3.8× |
| 256  | (skipped) | 1826.3  | 500.7  | 3.6× |
| 512  | (skipped) | 7151.1  | 1831.4 | 3.9× |
| 1024 | (skipped) | 28232.6 | 7278.3 | 3.9× |

The single sweep is a steady **~3.3–3.9×** over per-binding, the ratio widening
with *k* as more identical encode+decompose passes are amortised onto one. Both
remain ~quadratic in *k* here (the dense encoding still carries all *k·w* facts,
and the pinned DP still sweeps them once per answer — the shared-decomposition
pass removes the *repeated* encode+decompose constant, not the per-answer DP;
killing the latter is the top-down single-DP of §3, measured above). Per-binding is in
turn 10–40× faster than the re-gather-per-group transparent path, whose constant
is dominated by *k* planner substitutions and *k* full relation gathers. The
transparent path is skipped past k=128 (it is O(k) gathers of O(k·w) data, i.e.
the row already takes 5 s at k=128 and would dominate the run).

### Correlated regime (R(x) gated by an internal times of shared base events)

`test/bench/ucq_joint_single_sweep_correlated_bench.sql` times the same SRF
called once-per-answer (k walks + k decompositions + k DPs) against one call
with all candidates (1 walk + 1 decomposition + k pinned merged DPs) — the two
differ only by the shared walk+encode+decompose over the joint data+circuit
graph:

| k (answers) | per-binding ms | swept ms | per-binding / swept |
|---|---|---|---|
| 16  | 27.9   | 12.3   | 2.3× |
| 64  | 452.4  | 171.0  | 2.6× |
| 128 | 1588.8 | 664.7  | 2.4× |
| 256 | 6466.1 | 2593.4 | 2.5× |

A steady **~2.3–2.6×**, lower than the data-graph ~3.9× because the pinned
merged DP (per-world gate valuation + hom DP) is a larger share of each answer's
cost than the plain hom DP, so amortising the shared stages saves a smaller
fraction of the whole. Both methods agree on every answer.