# GraphRAG on `sorted_heap`

This note evaluates a narrow question:

> Can current `sorted_heap` + current vector search already support a useful
> GraphRAG-style retrieval workflow, or do we need a new storage/model layer?

The conclusion so far is:

- **1-hop fact retrieval by source entity already fits `sorted_heap` well.**
- **Naive SQL join-based multi-hop expansion does not expose much advantage.**
- **`ANY(array_of_seed_ids)` expansion does trigger `SortedHeapScan`, but on
  warm and medium-scale local benchmarks it still loses to heap+btree on
  end-to-end latency despite reading fewer blocks.**
- Narrow C helpers for expansion and fused top-K rerank now exist as:
  - `sorted_heap_expand_ids(...)`
  - `sorted_heap_expand_rerank(...)`
- A one-call convenience wrapper now exists as:
  - `sorted_heap_graph_rag_scan(...)`
- Those helpers materially improve the `sorted_heap` path on the synthetic
  GraphRAG benchmark, though pure heap+btree expansion is still faster on
  this synthetic workload.
- Therefore the next promising primitive was correctly **a narrow C helper**,
  not a new graph storage engine and not a giant monolithic
  `graph_rag_scan()` API.

## Existing anchors

The repository already has the main building blocks:

1. **Zone-map pruning on `sorted_heap`**
   - planner hook + `SortedHeapScan` custom scan
   - supports base-relation restriction on the leading PK columns

2. **Planner-integrated ANN via `sorted_hnsw`**
   - exact ordered results
   - works on both heap tables and `sorted_heap` tables

3. **Legacy graph traversal precedent**
   - `svec_graph_scan()` in `pq.c`
   - this is for ANN sidecar graph navigation, not fact graphs
   - still useful as evidence that the extension can host graph-like traversal
     logic in C

## What was benchmarked

Synthetic fact graph schema:

```sql
CREATE TABLE facts_heap (
    entity_id   int4 NOT NULL,
    relation_id int2 NOT NULL,
    target_id   int4 NOT NULL,
    embedding   svec(32) NOT NULL,
    payload     text NOT NULL,
    PRIMARY KEY (entity_id, relation_id, target_id)
);

CREATE TABLE facts_sh (
    entity_id   int4 NOT NULL,
    relation_id int2 NOT NULL,
    target_id   int4 NOT NULL,
    embedding   svec(32) NOT NULL,
    payload     text NOT NULL,
    PRIMARY KEY (entity_id, relation_id, target_id)
) USING sorted_heap;
```

Both tables also receive the same ANN index:

```sql
CREATE INDEX ... USING sorted_hnsw (embedding) WITH (m = 16, ef_construction = 64);
```

Benchmark harness:

- [`scripts/bench_graph_rag.py`](../scripts/bench_graph_rag.py)
- local ephemeral PostgreSQL 18 temp cluster
- deterministic synthetic fact graph
- compares:
  - `hop1_entity`
  - `hop1_entity_relation`
  - `hop2_join`
  - `hop2_in`
  - `seed_expand_join`
  - `seed_expand_in`
  - `seed_expand_rerank_join`
  - `seed_expand_rerank_in`
  - `seed_expand_fn`
  - `seed_expand_rerank_fn`
  - `seed_expand_rerank_topk_fn`
  - `seed_graph_rag_scan_fn`

The key comparison is between:

- **join-shaped expansion**
- **`ANY(array(seed_ids))` expansion**

The second shape is the one that allows `sorted_heap` to expose its pruning
logic directly on `entity_id`.

## Local findings

### Small smoke run

On a tiny graph (`300` entities, `4` edges/entity):

- `facts_sh` reduced buffer hits strongly for:
  - `hop1_entity`
  - `hop1_entity_relation`
  - `hop2_in`
  - `seed_expand_in`
- but end-to-end latency stayed close to heap because the whole dataset was
  fully warm and tiny

Most importantly:

- **join-shaped expansion largely erased the `sorted_heap` advantage**
- **`ANY(array(...))` expansion preserved `SortedHeapScan`**

### Medium warm run

On `20K` entities, `8` edges/entity (`160K` rows total), warm local cache:

- `hop1_entity`
  - heap: `Index Scan`
  - sorted_heap: `Custom Scan:SortedHeapScan`
  - sorted_heap reads fewer blocks and is roughly latency-parity

- `seed_expand_join`
  - bad shape for both
  - sorted_heap is not meaningfully better

- `seed_expand_in`
  - sorted_heap does use `SortedHeapScan`
  - buffer footprint drops
  - but **heap+btree still wins on total latency**

This means:

> current SQL shape can make `sorted_heap` read less, but executor/custom-scan
> overhead can still dominate the total time on warm-medium datasets

### Medium run with lower shared buffers

On `20K` entities, `16` edges/entity (`320K` rows total), `shared_buffers=64MB`:

- `hop1_entity`
  - sorted_heap stayed strong: fewer hits, same-or-better latency

- `seed_expand_join`
  - both paths were much worse
  - heap and sorted_heap were similar, with read noise dominating

- `seed_expand_in`
  - heap: lower latency
  - sorted_heap: fewer touched blocks / lower expansion footprint
  - but **still slower end-to-end**

This is the most important current result:

> On a graph larger than a warm toy dataset, `sorted_heap` already shows the
> expected locality/pruning behavior for seed expansion, but the current
> SQL + `CustomScan` path is not enough to turn that into a consistent latency
> win over heap+btree.

## Design implications

### What not to build first

1. **Not a new graph storage engine**
   - current evidence does not justify that jump
   - 1-hop retrieval is already good on current storage

2. **Not a giant monolithic `svec_graph_rag_scan()`**
   - it would have to combine:
     - ANN seed retrieval
     - graph expansion
     - rerank
   - this is a large surface area
   - it also risks duplicating planner/index logic from `sorted_hnsw`

### What to build next

The next narrow primitive should be something like:

```sql
sorted_heap_expand_ids(
    rel regclass,
    seed_ids int4[],
    relation_filter int2 DEFAULT NULL,
    limit_rows int4 DEFAULT 0
)
```

Why this shape:

- ANN seed retrieval can stay in SQL:
  - `SELECT target_id FROM facts ORDER BY embedding <=> $query LIMIT K`
- expansion becomes a dedicated low-overhead C primitive
- it avoids:
  - repeated executor/planner setup
  - generic `CustomScan` overhead for this narrow use case
- it keeps the product boundary small:
  - “expand these known entity IDs quickly”

That primitive can later be composed into:

1. SQL-only GraphRAG
2. a higher-level helper
3. maybe a monolithic API if the narrow primitive proves valuable

## Helper result

The narrow helpers now exist:

```sql
sorted_heap_expand_ids(
    rel regclass,
    seed_ids int4[],
    relation_filter int4 DEFAULT NULL,
    limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
    entity_id int4,
    relation_id int2,
    target_id int4,
    embedding svec,
    payload text
)
```

and:

```sql
sorted_heap_expand_rerank(
    rel regclass,
    seed_ids int4[],
    query svec,
    top_k int4,
    relation_filter int4 DEFAULT NULL,
    limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
    entity_id int4,
    relation_id int2,
    target_id int4,
    payload text,
    distance float8
)
```

and:

```sql
sorted_heap_expand_twohop_rerank(
    rel regclass,
    seed_ids int4[],
    query svec,
    top_k int4,
    hop1_relation_filter int4 DEFAULT NULL,
    hop2_relation_filter int4 DEFAULT NULL,
    limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
    entity_id int4,
    relation_id int2,
    target_id int4,
    payload text,
    distance float8
)
```

and:

```sql
sorted_heap_graph_rag_scan(
    rel regclass,
    query svec,
    ann_k int4,
    top_k int4,
    relation_filter int4 DEFAULT NULL,
    limit_rows int4 DEFAULT 0
)
RETURNS TABLE (
    entity_id int4,
    relation_id int2,
    target_id int4,
    payload text,
    distance float8
)
```

Their current contract is intentionally narrow:

- relation must be a `sorted_heap` table
- relation must expose the columns:
  - `entity_id int4`
  - `relation_id int2`
  - `target_id int4`
  - `embedding svec`
  - `payload text`
- the function reuses the zone-map range builder directly
- it emits fact rows for known source entity IDs

On the medium-pressure benchmark (`20K` entities, `16` edges/entity,
`320K` rows, `shared_buffers=64MB`, fresh backend, `runs=3`), the helpers
produced:

- `facts_heap seed_expand_in`: `0.123 ms`
- `facts_sh seed_expand_in`: `0.285 ms`
- `facts_sh seed_expand_fn`: `0.165 ms`
- `facts_sh seed_expand_rerank_in`: `0.369 ms`
- `facts_sh seed_expand_rerank_fn`: `0.234 ms`
- `facts_sh seed_expand_rerank_topk_fn`: `0.139 ms`
- `facts_sh seed_graph_rag_scan_fn`: `0.144 ms`

Interpretation:

- `sorted_heap_expand_ids()` converts the observed block-pruning/locality
  advantage into a **real latency win over the current SQL + CustomScan path**
- `sorted_heap_expand_rerank()` removes most of the remaining rerank overhead
  and is now materially faster than the current `sorted_heap` SQL rerank path
  (`0.139 ms` vs `0.369 ms`)
- `sorted_heap_graph_rag_scan()` is only slightly slower than the direct fused
  helper composition (`0.144 ms` vs `0.139 ms`), so the convenience API does
  not erase the win
- pure heap+btree expansion is still faster on this synthetic workload
  (`0.123 ms` vs `0.165 ms`)

Relation-filtered probes narrow that gap further:

- `facts_heap seed_expand_rel_in`: `0.074 ms`
- `facts_sh seed_expand_rel_in`: `0.151 ms`
- `facts_sh seed_expand_rel_fn`: `0.108 ms`
- `facts_heap seed_expand_rerank_rel_in`: `0.087 ms`
- `facts_sh seed_expand_rerank_rel_in`: `0.167 ms`
- `facts_sh seed_expand_rerank_rel_topk_fn`: `0.104 ms`
- `facts_sh seed_graph_rag_rel_scan_fn`: `0.120 ms`

So the relation-filtered GraphRAG path is materially better than the current
SQL + `CustomScan` form, but it still does not clearly beat heap+btree on this
synthetic corpus. The filtered helper path is nevertheless close enough that a
real fact graph, wider payloads, or colder cache state may flip the comparison.

Payload-width sensitivity does matter, but not monotonically.

The benchmark harness now supports `--payload-bytes` to widen synthetic fact
rows and test the claim that locality should matter more once facts stop being
tiny strings. On the same medium-pressure setup (`20K` entities, degree `16`,
`320K` rows, `shared_buffers=64MB`, fresh backend):

- with `payload_bytes=1024`
  - `facts_heap seed_expand_in`: `0.188 ms`
  - `facts_sh seed_expand_in`: `0.185 ms`
  - `facts_heap seed_expand_rerank_rel_in`: `0.120 ms`
  - `facts_sh seed_expand_rerank_rel_topk_fn`: `0.100 ms`
  - `facts_sh seed_graph_rag_rel_scan_fn`: `0.125 ms`

- with `payload_bytes=2048`
  - `facts_heap seed_expand_in`: `0.113 ms`
  - `facts_sh seed_expand_in`: `0.208 ms`
  - `facts_heap seed_expand_rerank_rel_in`: `0.090 ms`
  - `facts_sh seed_expand_rerank_rel_topk_fn`: `0.122 ms`
  - `facts_sh seed_graph_rag_rel_scan_fn`: `0.127 ms`

Interpretation:

- a wider inline payload can make `sorted_heap` competitive or slightly better
  on seed expansion
- but the effect is not monotonic, so "wider payload always helps sorted_heap"
  is false on this synthetic generator
- this synthetic text filler is still a weak proxy for real fact payloads
  because compression/TOAST behavior can change the balance again

So the next falsifier should be a real-dataset GraphRAG harness or a more
realistic payload model, not another synthetic-only extrapolation.

## Real-text Gutenberg graph

A better falsifier now exists in:

- [`scripts/bench_graph_rag_gutenberg.py`](../scripts/bench_graph_rag_gutenberg.py)

This harness uses real Gutenberg paragraphs instead of synthetic payload text.
It builds a small text graph:

- relation `1`: `book -> paragraph` (`contains`)
- relation `2`: `paragraph -> next_paragraph` (`next`)

Embeddings are still deterministic lexical hash vectors, not external model
embeddings. That means this harness is good for measuring graph-expansion
latency on real text payloads and a real graph topology, but it is not a
semantic-quality benchmark.

Two useful runs on `shared_buffers=64MB`, fresh backend:

`64 books x 128 paragraphs/book` (`14,549` rows):

- `facts_heap seed_expand_rerank_rel_in`: `0.071 ms`
- `facts_sh seed_expand_rerank_rel_in`: `0.088 ms`
- `facts_sh seed_expand_rerank_rel_topk_fn`: `0.061 ms`
- `facts_sh seed_graph_rag_rel_scan_fn`: `0.084 ms`

`128 books x 256 paragraphs/book` (`58,954` rows):

- `facts_heap seed_expand_rel_in`: `0.073 ms`
- `facts_sh seed_expand_rel_in`: `0.078 ms`
- `facts_sh seed_expand_rel_fn`: `0.069 ms`
- `facts_heap seed_expand_rerank_rel_in`: `0.079 ms`
- `facts_sh seed_expand_rerank_rel_in`: `0.101 ms`
- `facts_sh seed_expand_rerank_rel_topk_fn`: `0.063 ms`
- `facts_sh seed_graph_rag_rel_scan_fn`: `0.089 ms`

This is the first non-synthetic result that materially weakens the earlier
"heap+btree simply wins" story:

- the plain `sorted_heap` SQL path is still worse than heap+btree
- but the fused filtered helper on the real-text Gutenberg graph is already
  at parity or slightly better than heap+btree on the rerank path
- the one-call wrapper is close enough that its overhead is visible but not
  disqualifying

So the narrow-helper direction survives the real-text falsifier better than the
short-payload synthetic benchmark suggested.

## pgvector parity on the real-text graph

The Gutenberg harness also now supports a comparable `pgvector` path on the
same graph:

- ANN seeds come from a `facts_pgv` table with `vector(dim)` + HNSW
- graph expansion and exact rerank still happen in PostgreSQL over the fact
  rows, which is the relevant GraphRAG shape

This is important because a pure ANN benchmark would miss the real product
question: how expensive is "ANN seed + graph expansion + exact rerank" as one
workflow?

On fresh-backend runs with `shared_buffers=64MB`:

`64 books x 128 paragraphs/book` (`14,549` rows):

- heap rerank baseline: `0.064 ms`
- `sorted_heap_expand_rerank(... relation=2)`: `0.060 ms`
- `sorted_heap_graph_rag_scan(... relation=2)`: `0.075 ms`
- `pgvector ANN -> heap expansion -> exact rerank`: `0.180 ms`

`128 books x 256 paragraphs/book` (`58,954` rows):

- heap rerank baseline: `0.085 ms`
- `sorted_heap_expand_rerank(... relation=2)`: `0.071 ms`
- `sorted_heap_graph_rag_scan(... relation=2)`: `0.087 ms`
- `pgvector ANN -> heap expansion -> exact rerank`: `0.295 ms`

The buffer footprint matches the latency story:

- `sorted_heap` helper path stays around hundreds of shared-buffer hits
- the `pgvector` path needs several thousands of shared-buffer hits before the
  same exact rerank step

This does **not** mean `pgvector` is bad at pure ANN. It means that for this
GraphRAG workload shape, once the seed stage is followed by relational graph
expansion and exact rerank, the narrow `sorted_heap` helper path is materially
better aligned with the whole workflow than an external ANN seed on a separate
table.

## zvec parity on the real-text graph

The same Gutenberg harness now also supports a comparable `zvec` path:

- ANN seeds come from a temporary `zvec` HNSW collection built from the same
  fact rows
- graph expansion and exact rerank still happen in PostgreSQL over `facts_heap`

This produced a mixed but useful result.

On the medium real-text slice (`64 books x 128 paragraphs/book`, `14,549` rows,
fresh backend, `shared_buffers=64MB`):

- heap rerank baseline: `0.068 ms`
- `sorted_heap_expand_rerank(... relation=2)`: `0.066 ms`
- `sorted_heap_graph_rag_scan(... relation=2)`: `0.082 ms`
- `zvec ANN -> heap expansion -> exact rerank`: `0.322 ms`

So on the medium slice, the `zvec` path is stable but materially slower than
the fused `sorted_heap` helper. The SQL-side buffer footprint is not the
bottleneck there; the external ANN seed stage dominates the total latency.

On the larger real-text slice (`128 books x 256 paragraphs/book`, `58,954`
rows), the result is currently **not publishable as a clean latency row**:

- the `sorted_heap` helper path remains stable:
  - `sorted_heap_expand_rerank(... relation=2)`: `0.070 ms`
  - `sorted_heap_graph_rag_scan(... relation=2)`: `0.084 ms`
- the `zvec` path fails during ANN seed retrieval at `ann_k=32`

The failure is not coming from PostgreSQL or from the GraphRAG SQL wrapper.
A pure `zvec`-only reproduction on the same `58,954`-row lexical-hash corpus
shows the same failure mode:

- for one probe query, `topk=8` and `topk=10` return valid document IDs
- `topk>=16` returns empty `doc.id` values after:
  - `Failed to find target chunk for index 58379`

The Gutenberg GraphRAG harness now turns that into an explicit benchmark error:

- `RuntimeError: zvec returned unmapped doc ids (...)`

So the objective conclusion today is narrower than for `pgvector`:

- `zvec` does not currently provide a robust large-slice GraphRAG parity row on
  this real-text workflow at `ann_k=32`
- on the medium slice where it does run, it is materially slower than the
  fused `sorted_heap` helper path
- on the larger slice, the current blocker is `zvec` ANN seed instability, not
  PostgreSQL expansion/rerank overhead

That instability is now isolated more sharply by the repo-owned reproducer:

- [`scripts/repro_zvec_gutenberg_threshold.py`](../scripts/repro_zvec_gutenberg_threshold.py)

Current threshold signature on the lexical-hash Gutenberg corpus:

- `topk=16`, `dim=32`
- `64x256`, `80x256`, `96x256`, `112x256` slices are stable
  - `28,661`, `36,064`, `43,684`, `51,166` rows
- `128x256` fails
  - `58,954` rows
  - first bad probe: `query #10`
  - returned ids are empty strings after
    `Failed to find target chunk for index 58379`

So the current failure signature is not just "large-ish GraphRAG benchmark".
It looks more like a size-thresholded `zvec` retrieval bug on this corpus
shape.

That theory is now falsified by a second repo-owned reproducer on a plain
synthetic FP32 corpus:

- [`scripts/repro_zvec_synthetic_threshold.py`](../scripts/repro_zvec_synthetic_threshold.py)

Current synthetic signature:

- `dim=32`, `ef_search=64`
- `topk=7` already reproduces the issue
- a compact failing case exists at `4,950` rows
  - nearby controls:
    - `4,900` rows: ok
    - `4,950` rows: bad
    - `5,000` rows: bad
  - `topk<=6` is clean on the `4,950`-row case
- failures are non-monotonic by row count
  - bad: `16,000`, `20,000`, `28,000`, `30,000`, `45,000`, `60,000`
  - ok: `24,000`, `29,000`, `75,000` (`100` probe queries still clean at `75k`)
- another local non-monotonic pocket exists around `7k-8k`
  - `7,000`: ok
  - `7,500`: bad
  - `7,800`: ok
  - `7,900`: bad
- representative stderr lines:
  - `Failed to find target chunk for index 4945`
  - `Failed to find target chunk for index 14999`
  - `Failed to find target chunk for index 29999`
  - `Failed to find target chunk for index 59999`

So the stronger objective conclusion is:

- the failure is not Gutenberg-specific
- it is not a simple monotonic "too many rows" threshold either
- the current evidence points to a broader `zvec` retrieval defect around
  forward-store / chunk lookup, not to PostgreSQL GraphRAG expansion logic

For an upstream-ready summary of the current evidence, see:

- [`docs/zvec-empty-id-bug.md`](./zvec-empty-id-bug.md)

Two more diagnostic observations make that conclusion sharper:

- when the synthetic bug triggers, the ANN scores still come back while
  `doc.id` is empty for the whole result set
  - `4,950 rows`, `topk=6`: valid ids
  - `4,950 rows`, `topk=7`: same score bands, but every `doc.id` is `''`
- on a larger synthetic case (`16,000` rows), exact cosine inspection shows
  the best-score bucket spans `1000, 2000, ..., 16000`, and `zvec` already
  returns empty ids at `topk=5`

That does not prove the internal root cause, but it strongly suggests the ANN
ranking stage is still producing plausible scores while the forward-store
document lookup stage is failing. A reasonable working hypothesis is that some
tied-score / candidate-materialization paths touch unresolved high indexes and
poison metadata resolution for the whole returned batch.

## Qdrant parity on the real-text graph

The Gutenberg harness now also supports a comparable `Qdrant` path:

- ANN seeds come from a local Qdrant HNSW collection built from the same fact
  rows
- graph expansion and exact rerank still happen in PostgreSQL over `facts_heap`

Unlike `zvec`, this path stayed stable on both the medium and larger real-text
slices. The result is simpler:

`64 books x 128 paragraphs/book` (`14,549` rows):

- heap rerank baseline: `0.074 ms`
- `sorted_heap_expand_rerank(... relation=2)`: `0.062 ms`
- `sorted_heap_graph_rag_scan(... relation=2)`: `0.083 ms`
- `Qdrant ANN -> heap expansion -> exact rerank`: `1.535 ms`

`128 books x 256 paragraphs/book` (`58,954` rows):

- heap rerank baseline: `0.081 ms`
- `sorted_heap_expand_rerank(... relation=2)`: `0.083 ms`
- `sorted_heap_graph_rag_scan(... relation=2)`: `0.085 ms`
- `Qdrant ANN -> heap expansion -> exact rerank`: `1.769 ms`

So on this GraphRAG workflow shape:

- Qdrant is robust on the real-text benchmark
- but its external ANN seed stage dominates end-to-end latency
- the fused `sorted_heap` helper remains roughly an order of magnitude faster
  on the rerank path

That again does **not** mean Qdrant is a bad vector engine in isolation. It
means that when the workflow is "external ANN seed + relational graph
expansion + exact rerank inside PostgreSQL", the narrow in-engine helper path
is much better aligned with the total job than a remote vector service.

## Robustness rerun

The same real-text Gutenberg harness was then rerun with a larger query set
(`query_count=64`, `runs=3`) to check whether the earlier `16`-query results
were just small-sample noise.

The ranking stayed the same on both slices:

- medium slice (`64 x 128`):
  - `sorted_heap_expand_rerank(... relation=2)`: `0.062 ms`
  - `sorted_heap_graph_rag_scan(... relation=2)`: `0.081 ms`
  - `pgvector ANN -> heap expansion -> exact rerank`: `0.219 ms`
  - `zvec ANN -> heap expansion -> exact rerank`: `0.342 ms`
  - `Qdrant ANN -> heap expansion -> exact rerank`: `1.567 ms`

- larger slice (`128 x 256`):
  - `sorted_heap_expand_rerank(... relation=2)`: `0.067 ms`
  - `sorted_heap_graph_rag_scan(... relation=2)`: `0.088 ms`
  - `pgvector ANN -> heap expansion -> exact rerank`: `0.309 ms`
  - `Qdrant ANN -> heap expansion -> exact rerank`: `1.911 ms`
  - `zvec` remains excluded from this large-slice rerun because the
    previously observed `ann_k=32` instability is still the blocker

So the current GraphRAG conclusion is no longer resting on one short probe set.
At least on this real-text Gutenberg workflow, the fused `sorted_heap` helper
still has the best end-to-end latency profile after the query set is expanded.

## Two-hop Gutenberg composition

The next adversarial question was whether the current helper story survives a
real **two-hop** workflow, not just the earlier "ANN seeds -> one filtered
expansion -> rerank" shape.

The initial Gutenberg falsifier first used a composed path from the existing
narrow primitives:

1. ANN seeds from the fact table
2. first hop via `sorted_heap_expand_ids(..., relation=2)`
3. second hop via `sorted_heap_expand_rerank(..., relation=2)`

That composition benchmark was intentionally a harsher test than the earlier
one-hop helper story, because it asked whether the current primitives were
already enough to make multi-hop GraphRAG plausible before inventing a
dedicated two-hop helper.

The answer was "yes, barely enough". That justified one narrow extra helper,
not a new storage engine:

```sql
sorted_heap_expand_twohop_rerank(...)
```

This fused helper keeps the same contract shape as the earlier rerank helper,
but removes the intermediate SQL/materialization boundary between hop1 and the
second-hop rerank.

On the medium real-text slice (`64 books x 128 paragraphs/book`, `14,549` rows,
`32D`, `query_count=64`, `runs=3`, fresh backend, `shared_buffers=64MB`):

- heap baseline, `seed_expand2_rerank_rel_in`: `0.102 ms`
- plain `sorted_heap` SQL, `seed_expand2_rerank_rel_in`: `0.136 ms`
- helper-composed `sorted_heap`, `seed_expand2_rerank_rel_topk_fn`: `0.105 ms`
- fused `sorted_heap_expand_twohop_rerank(...)`: `0.081 ms`

So on the medium slice, the dedicated helper now does what the composed path
only hinted at:

- it beats heap+btree on latency
- it materially beats the composed two-hop helper path
- it also cuts shared-buffer hits strongly (`421` vs `1298` for the heap
  baseline, and `421` vs `662` for the composed helper)

On the larger real-text slice (`128 books x 256 paragraphs/book`, `58,954`
rows, same settings except the larger corpus):

- heap baseline, `seed_expand2_rerank_rel_in`: `0.114 ms`
- plain `sorted_heap` SQL, `seed_expand2_rerank_rel_in`: `0.153 ms`
- helper-composed `sorted_heap`, `seed_expand2_rerank_rel_topk_fn`: `0.111 ms`
- fused `sorted_heap_expand_twohop_rerank(...)`: `0.092 ms`

So the larger slice confirms the same shape: the dedicated two-hop helper is
not a tiny micro-win on one probe set; it keeps the lead over both heap+btree
and the composed helper.

The same medium two-hop slice was also benchmarked against the external ANN
seed paths:

- `pgvector ANN -> heap 2-hop expansion -> exact rerank`: `0.253 ms`
- `zvec ANN -> heap 2-hop expansion -> exact rerank`: `0.374 ms`
- `Qdrant ANN -> heap 2-hop expansion -> exact rerank`: `1.789 ms`

So the product-level conclusion stays consistent in the two-hop case as well:
the narrow in-engine `sorted_heap` helper remains the fastest end-to-end
GraphRAG path among the tested competitors on this real-text slice.

At higher exact-rerank dimension, the advantage narrows again rather than
disappearing:

`64 books x 128 paragraphs/book`, `384D`, `query_count=64`, `runs=3`:

- heap baseline, `seed_expand2_rerank_rel_in`: `0.225 ms`
- plain `sorted_heap` SQL, `seed_expand2_rerank_rel_in`: `0.266 ms`
- helper-composed `sorted_heap`, `seed_expand2_rerank_rel_topk_fn`: `0.258 ms`
- fused `sorted_heap_expand_twohop_rerank(...)`: `0.236 ms`

Interpretation:

- the dedicated helper makes two-hop GraphRAG clearly viable on the real-text
  Gutenberg path
- the latency win is still not universal; at higher dimensions it narrows
  toward parity with heap+btree
- but the locality signal remains stronger than latency alone suggests
  (`1264` shared hits for the fused helper vs `3155` for the heap baseline on
  the `384D` medium run)

So the correct next inference is narrower than "we need a graph storage
engine" and also narrower than "we need a broad graph query layer":

> a dedicated but still narrow two-hop helper is justified; anything broader
> should now be treated as product/API design, not as a prerequisite for making
> two-hop GraphRAG fast enough to matter.

## Higher-dimension rerun

The same medium Gutenberg slice (`64 books x 128 paragraphs/book`) was then
rerun at higher lexical-hash embedding dimensions to test whether the earlier
result depended too heavily on the cheap `32D` setting.

At `128D` (`query_count=64`, `runs=3`):

- heap rerank baseline: `0.107 ms`
- `sorted_heap_expand_rerank(... relation=2)`: `0.090 ms`
- `sorted_heap_graph_rag_scan(... relation=2)`: `0.097 ms`
- `pgvector ANN -> heap expansion -> exact rerank`: `0.386 ms`
- `zvec ANN -> heap expansion -> exact rerank`: `0.518 ms`
- `Qdrant ANN -> heap expansion -> exact rerank`: `1.732 ms`

At `384D` on the same slice:

- heap rerank baseline: `0.185 ms`
- `sorted_heap_expand_rerank(... relation=2)`: `0.186 ms`
- `sorted_heap_graph_rag_scan(... relation=2)`: `0.203 ms`
- `pgvector ANN -> heap expansion -> exact rerank`: `0.815 ms`
- `zvec ANN -> heap expansion -> exact rerank`: `1.101 ms`
- `Qdrant ANN -> heap expansion -> exact rerank`: `2.275 ms`

This changes the interpretation in one important way:

- the `sorted_heap` helper remains clearly best-aligned with the full
  GraphRAG workflow versus the external ANN paths
- but the win over the pure heap rerank baseline is **dimension-sensitive**
- by `384D`, exact rerank cost dominates enough that the fused helper is only
  at parity with heap+btree rather than clearly ahead

So the current evidence supports a narrower claim than "sorted_heap always wins
GraphRAG":

> the fused `sorted_heap` helper is the best end-to-end path among the tested
> in-PG and external ANN competitors on this workflow shape, but its advantage
> over heap+btree narrows substantially as exact rerank dimension grows

One more tuning falsifier was useful here:

- dropping `ann_k` from `32` to `24` on the `384D` medium slice does reduce
  latency
- but it is **not** a free operating-point improvement
- a direct result-set comparison for `sorted_heap_graph_rag_scan(...)` on the
  `64`-query probe set showed mismatches on `62/64` queries versus `ann_k=32`

So the current faster-than-`ann_k=32` settings should be treated as a
quality/latency tradeoff, not as a no-regression default recommendation.

One important measurement caveat was also discovered and fixed during this
work:

- direct filtered `ORDER BY embedding <=> $query LIMIT K` on a base table with
  a `sorted_hnsw` index is **not** a valid GraphRAG baseline for current
  Phase 1 semantics
- the automatic `sorted_hnsw` path is now explicitly costed out when extra
  base-relation quals are present
- GraphRAG rerank baselines must therefore materialize the expanded set first,
  then rerank it

This is enough to falsify the pessimistic branch:

> the next useful GraphRAG step is not necessarily a new storage engine; a
> carefully scoped C primitive can already recover a substantial part of the
> lost latency

## Recommended roadmap

### Phase 0 — completed

- Build local prototype benchmark
- Falsify naive SQL assumptions

### Phase 1 — current

`sorted_heap_expand_ids()` is implemented and regression-covered.

### Phase 2 — current

`sorted_heap_expand_rerank()` is implemented and regression-covered.

Current success criterion that was met:

- beats the current `sorted_heap` SQL `seed_expand_in` / `seed_expand_rerank_in`
  patterns at medium scale

Current gap that remains:

- pure heap+btree expansion is still faster on this synthetic benchmark

### Phase 3 — next

Add GraphRAG composition query:

- ANN seed in SQL via `sorted_hnsw`
- expansion via `sorted_heap_expand_ids()`
- rerank via `sorted_heap_expand_rerank()` or SQL over materialized expansion

### Phase 4 — current

`sorted_heap_graph_rag_scan()` is now implemented as the narrow one-call
composition wrapper.

### Phase 5 — current

`sorted_heap_expand_twohop_rerank()` is now implemented as the narrow fused
two-hop helper.

Current success criterion that was met:

- beats the previous composed two-hop helper on the real-text Gutenberg graph
- beats heap+btree on the medium and larger `32D` two-hop slices

Current gap that remains:

- at `384D`, the fused two-hop helper narrows to near-parity with heap+btree
  rather than keeping a clear lead

### Phase 6 — next

Only if the current two-hop and one-call wrappers still leave meaningful
headroom:

- consider a broader wrapper for:
  - ANN seed IDs
  - two-hop expansion
  - rerank
- or tune candidate count / rerank workload rather than broadening the API

## Cogniformerus-style multihop facts

The real missing falsifier was not another paragraph graph slice. It was a
benchmark that matches the current `cogniformerus` multihop question shape:

- fact `1`: `person -> parent`
- fact `2`: `parent -> city`
- query: `Where does the parent of Person_i live?`

That now exists in:

- [`scripts/bench_graph_rag_multihop.py`](../scripts/bench_graph_rag_multihop.py)
- [`scripts/sweep_graph_rag_multihop.py`](../scripts/sweep_graph_rag_multihop.py)

The benchmark builds a deterministic fact graph and measures:

- latency
- `hit@1`
- `hit@k`

for the expected final `city` fact after two-hop expansion and rerank.

### Important contract discovery

This benchmark immediately exposed a semantic limitation in the current
convenience wrapper:

- `sorted_heap_graph_rag_scan()` seeds expansion from ANN `target_id`
- that is a good fit for the Gutenberg `paragraph -> next_paragraph` graph
- it is **not** the right seed contract for the fact benchmark above
- the fact benchmark needs ANN seeds based on `entity_id`, then:
  - hop 1 on relation `1`
  - hop 2 on relation `2`

So the current one-call wrapper is still too specialized for this workload
shape. The lower-level helper family is fine; the wrapper contract is the
narrow part.

That gap is now closed by:

- `sorted_heap_graph_rag_twohop_scan(...)`

This wrapper keeps the fact-shaped contract narrow:

- ANN seed on `entity_id`
- hop 1 relation filter
- hop 2 relation filter
- final rerank delegated to `sorted_heap_expand_twohop_rerank(...)`

### Early failure that mattered

At `32D`, the fact benchmark initially produced very poor answer retrieval.
That was a benchmark-quality failure, not a helper failure:

- the first draft seeded on `target_id`, which was the wrong graph contract
- after fixing that, the deterministic query embedding was still too weak
  at low dimension to make the question reliably retrievable

So the publishable multihop results start at `384D`, where the question shape
becomes stable enough that latency numbers mean something.

### Tuned 384D result

On `5K` multihop chains (`10K` rows total), `64` queries, `3` runs,
`shared_buffers=64MB`, fresh backend, with:

- `ann_k=64`
- `sorted_hnsw.ef_search=64`
- `ef_construction=200`

the current frontier is:

- heap composed two-hop SQL
  - `0.515 ms`
  - `hit@1 = 71.9%`
  - `hit@k = 85.9%`
- `sorted_heap` composed two-hop helper
  - `0.471 ms`
  - `hit@1 = 70.3%`
  - `hit@k = 82.8%`
- `sorted_heap_expand_twohop_rerank()`
  - `0.442 ms`
  - `hit@1 = 70.3%`
  - `hit@k = 82.8%`
- `sorted_heap_graph_rag_twohop_scan()`
  - `0.417 ms`
  - `hit@1 = 71.9%`
  - `hit@k = 84.4%`
- pgvector
  - `1.397 ms`
  - `hit@1 = 70.3%`
  - `hit@k = 87.5%`
- zvec
  - `1.076 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`
- Qdrant
  - `2.921 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`

Interpretation:

- the fused two-hop helper is now the **fastest PostgreSQL path** on this
  fact-shaped workload
- the new fact-shaped one-call wrapper stays effectively at parity with the
  fused helper, so this time the convenience API does **not** erase the win
- it remains materially faster than pgvector on the same workflow
- it is **not** the quality leader at this operating point
- zvec and Qdrant still win on answer retrieval quality here, but at much
  higher latency

### Seed frontier after the wrapper fix

The next honest question was not API shape but ANN seed quality. That is now
measured directly by:

- [`scripts/sweep_graph_rag_multihop.py`](../scripts/sweep_graph_rag_multihop.py)

This harness keeps the corpus fixed per `ef_construction` and sweeps:

- `m`
- `ann_k`
- `sorted_hnsw.ef_search`
- `ef_construction`

without paying a full temp-cluster and schema rewrite for every single probe
point.

On the same `5K` chains / `10K` rows / `384D` / `64` queries / fresh-backend
benchmark, the stable wrapper frontier is now:

- `ef_construction=64`, `ann_k=64`, `ef_search=64`
  - `0.386 ms`
  - `hit@1 = 70.3%`
  - `hit@k = 82.8%`
- `ef_construction=200`, `ann_k=64`, `ef_search=64`
  - `0.393 ms`
  - `hit@1 = 71.9%`
  - `hit@k = 84.4%`
- `ef_construction=400`, `ann_k=64`, `ef_search=64`
  - `0.421 ms`
  - `hit@1 = 70.3%`
  - `hit@k = 85.9%`
- `ef_construction=200`, `ann_k=64`, `ef_search=128`
  - `0.651 ms`
  - `hit@1 = 73.4%`
  - `hit@k = 95.3%`
- `ef_construction=400`, `ann_k=64`, `ef_search=128`
  - `0.663 ms`
  - `hit@1 = 75.0%`
  - `hit@k = 95.3%`

For a higher-quality but much slower seed tier:

- `ann_k=96`, `ef_search=64` lands around `2.2-2.4 ms`
  with `hit@k = 96.9%`

That leads to a narrower, more honest recommendation:

- if latency is the hard constraint, keep the fast tier near
  `ef_construction=200`, `ann_k=64`, `ef_search=64`
- if answer quality matters more, the best balanced point we measured is
  `ef_construction=200`, `ann_k=64`, `ef_search=128`
- `ef_construction=400` does improve `hit@1` slightly at the same `95.3%`
  `hit@k`, but it does not improve `hit@k` over `200`, so it should not be
  the default recommendation without a separate build-cost justification

That build-cost justification now exists too on this exact `10K x 384D`
multihop benchmark:

- `ef_construction=64`: `43.716 s` to build both ANN indexes
- `ef_construction=200`: `80.046 s`
- `ef_construction=400`: `91.352 s`

So the current recommendation is:

- default to `ef_construction=200`
- treat `ef_construction=400` as a niche `hit@1` knob, not the new default

### `m` frontier on the same multihop benchmark

The next useful falsifier was whether graph degree buys more than another
`ef_construction` increase.

Keeping:

- `ef_construction=200`
- `ann_k=64`
- `64` queries
- `3` runs
- fresh backend

the `m` sweep came out as:

- `m=16`, `ef_search=64`
  - `0.405 ms`
  - `hit@1 = 71.9%`
  - `hit@k = 87.5%`
- `m=24`, `ef_search=64`
  - `0.466 ms`
  - `hit@1 = 75.0%`
  - `hit@k = 93.8%`
- `m=32`, `ef_search=64`
  - `0.491 ms`
  - `hit@1 = 78.1%`
  - `hit@k = 93.8%`
- `m=16`, `ef_search=128`
  - `0.672 ms`
  - `hit@1 = 73.4%`
  - `hit@k = 95.3%`
- `m=24`, `ef_search=128`
  - `0.738 ms`
  - `hit@1 = 75.0%`
  - `hit@k = 96.9%`
- `m=32`, `ef_search=128`
  - `0.771 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`

The one-off build-cost probe for the same `10K x 384D` graph was:

- `m=16`, `ef_construction=200`: `79.425 s`
- `m=24`, `ef_construction=200`: `86.562 s`
- `m=32`, `ef_construction=200`: `75.404 s`

That last `m=32` build number should be treated cautiously; it was a single
one-off probe and is likely noisy enough that only the query-time frontier is
trustworthy here.

The stable conclusion is still clear:

- `m=24` is the best current quality-per-latency tradeoff we measured
- `m=32` buys a little more `hit@1`, but no additional `hit@k`
- so for fact-shaped multihop GraphRAG, the best current balanced point is:
  - `m=24`
  - `ef_construction=200`
  - `ann_k=64`
  - `sorted_hnsw.ef_search=128`

One more ann_k falsifier matters here too:

- increasing `ann_k` above `64` at this `m=24 / ef_construction=200 / ef_search=128`
  point did **not** help
- `ann_k=80/96/128` all increased latency and reduced `hit@k`
- so `ann_k=64` remains the current sweet spot, not just a legacy default

### Full parity rerun at the balanced point

Re-running the full multihop parity benchmark on that exact setting:

- `m=24`
- `ef_construction=200`
- `ann_k=64`
- `sorted_hnsw.ef_search=128`
- `64` queries
- `3` runs
- `384D`

produced:

- heap two-hop SQL
  - `0.762 ms`
  - `hit@1 = 75.0%`
  - `hit@k = 96.9%`
- `sorted_heap_expand_twohop_rerank()`
  - `0.726 ms`
  - `hit@1 = 75.0%`
  - `hit@k = 96.9%`
- `sorted_heap_graph_rag_twohop_scan()`
  - `0.727 ms`
  - `hit@1 = 75.0%`
  - `hit@k = 96.9%`
- pgvector
  - `1.244 ms`
  - `hit@1 = 70.3%`
  - `hit@k = 85.9%`
- zvec
  - `0.927 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`
- Qdrant
  - `2.417 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`

That is a materially stronger result than the earlier `m=16` baseline:

- the fused `sorted_heap` path now matches `zvec` and `Qdrant` on `hit@k`
- it stays faster than both external paths
- it also beats pgvector on both latency and answer quality on this workload
- `zvec` and `Qdrant` still keep a small `hit@1` edge, so the answer-quality
  story is now about `hit@1`, not `hit@k`

### Full parity rerun at the higher-quality point

The next question was whether that remaining `hit@1` gap could be closed
without giving back the latency lead. Re-running the same full parity benchmark
at:

- `m=32`
- `ef_construction=200`
- `ann_k=64`
- `sorted_hnsw.ef_search=128`

produced:

- heap two-hop SQL
  - `0.810 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`
- `sorted_heap_expand_twohop_rerank()`
  - `0.774 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`
- `sorted_heap_graph_rag_twohop_scan()`
  - `0.786 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`
- pgvector
  - `1.220 ms`
  - `hit@1 = 70.3%`
  - `hit@k = 84.4%`
- zvec
  - `0.874 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`
- Qdrant
  - `2.487 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`

So the current picture is now more precise:

- `m=24` is still the better quality-per-latency recommendation
- `m=32` is the point where `sorted_heap` reaches full observed parity with
  `zvec` and Qdrant on both `hit@1` and `hit@k`
- even at that higher-quality point, the `sorted_heap` helper remains faster
  than both external paths
- pgvector remains behind on both latency and answer quality on this workload

### AWS ARM64 parity rerun (`5K` chains)

The next environment-variance adversary check was to rerun the same
`5K`-chain / `10K`-row / `384D` fact benchmark on an AWS ARM64 host
(`4 vCPU`, `8 GiB RAM`) using the repo-owned wrapper:

- [`scripts/bench_graph_rag_multihop_aws.sh`](../scripts/bench_graph_rag_multihop_aws.sh)

At the previously recommended local balanced point:

- `m=24`
- `ef_construction=200`
- `ann_k=64`
- `sorted_hnsw.ef_search=128`
- `64` queries
- `3` runs
- fresh backend

the AWS rerun produced:

- heap two-hop SQL
  - `1.087 ms`
  - `hit@1 = 75.0%`
  - `hit@k = 96.9%`
- `sorted_heap_expand_twohop_rerank()`
  - `0.947 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 98.4%`
- `sorted_heap_graph_rag_twohop_scan()`
  - `1.004 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 98.4%`
- pgvector
  - `1.296 ms`
  - `hit@1 = 70.3%`
  - `hit@k = 85.9%`
- zvec
  - `1.646 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`
- Qdrant
  - `3.396 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`

That is stronger than the local balanced point in one important way:

- on this AWS rerun, `sorted_heap` does not just match `zvec` and Qdrant on
  `hit@k`; it exceeds them (`98.4%` vs `96.9%`) while staying faster than both

But the second half of the adversary check matters too. Re-running the same
AWS benchmark at the local higher-quality point:

- `m=32`
- `ef_construction=200`
- `ann_k=64`
- `sorted_hnsw.ef_search=128`

produced:

- `sorted_heap_graph_rag_twohop_scan()`
  - `1.066 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 96.9%`

So the local `m=32` parity story does **not** carry over unchanged to this
AWS ARM64 environment. The portable conclusion is therefore narrower:

- `m=24 / ef_construction=200 / ann_k=64 / ef_search=128` is the current
  best verified cross-environment point
- local and AWS frontiers are directionally consistent, but not numerically
  identical
- this is exactly why the AWS rerun is worth keeping as a separate falsifier,
  not merging blindly into the local tuning story

### Larger local scale check (`10K` chains)

The next adversary check was whether the `5K`-chain tuning carried forward to a
larger local fact graph without retuning.

On `10K` chains (`20K` rows total), `64` queries, `384D`, fresh backend:

- `m=24`, `ef_construction=200`, `ann_k=64`, `ef_search=128`
  - `sorted_heap_graph_rag_twohop_scan()` -> `0.885 ms`
  - `hit@1 = 71.9%`
  - `hit@k = 92.2%`
- `m=32`, `ef_construction=200`, `ann_k=64`, `ef_search=128`
  - `sorted_heap_graph_rag_twohop_scan()` -> `0.972 ms`
  - `hit@1 = 73.4%`
  - `hit@k = 93.8%`

So the `5K`-chain operating point does **not** generalize unchanged.

The next narrow falsifier was whether this larger-graph drop was just a search
beam issue. Sweeping `ef_search` upward at `m=32` gave:

- `ef_search=192`
  - `1.310 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 95.3%`
- `ef_search=256`
  - `1.734 ms`
  - `hit@1 = 78.1%`
  - `hit@k = 95.3%`

That is a useful but incomplete recovery:

- higher `ef_search` does recover part of the quality loss
- it does **not** recover the earlier `96.9% hit@k` local point
- so the larger-graph gap is not purely a beam-width problem

The next falsifier after that was stronger graph construction. On the same
`10K`-chain graph, keeping `m=32`, `ann_k=64`, and comparing
`ef_construction=200` vs `400` gave:

- at `ef_search=128`
  - `ef_construction=200` -> `0.976 ms`, `hit@1 = 75.0%`, `hit@k = 93.8%`
  - `ef_construction=400` -> `1.094 ms`, `hit@1 = 75.0%`, `hit@k = 93.8%`
- at `ef_search=192`
  - `ef_construction=200` -> `1.357 ms`, `hit@1 = 76.6%`, `hit@k = 95.3%`
  - `ef_construction=400` -> `1.381 ms`, `hit@1 = 76.6%`, `hit@k = 95.3%`

So this larger-graph gap is not fixed by a simple `ef_construction=400` bump
either.

The current best explanation is therefore narrower:

- the verified `5K`-chain local frontier is real
- the same operating points do not carry forward unchanged to `10K` chains
- and the obvious local rescue knobs (`ef_search`, `ef_construction`) only
  recover part of the drop

That is enough to stop local knob-turning for this pass. The next useful step
would be a different class of experiment, not more of the same sweep.

The next adversary check after that was whether this larger-graph caveat was
just a local-machine artifact. Re-running the `10K`-chain benchmark on the
same AWS ARM64 host (`4 vCPU`, `8 GiB RAM`) showed that it is not.

At the same balanced portable point:

- `m=24`
- `ef_construction=200`
- `ann_k=64`
- `sorted_hnsw.ef_search=128`

the AWS rerun produced:

- heap two-hop SQL
  - `1.389 ms`
  - `hit@1 = 71.9%`
  - `hit@k = 92.2%`
- `sorted_heap_expand_twohop_rerank()`
  - `1.190 ms`
  - `hit@1 = 71.9%`
  - `hit@k = 92.2%`
- `sorted_heap_graph_rag_twohop_scan()`
  - `1.248 ms`
  - `hit@1 = 71.9%`
  - `hit@k = 92.2%`

That essentially matches the larger local result. So the `10K`-chain drop is
cross-environment robust, not just a local Apple/M-series artifact.

The one meaningful local rescue point transferred cleanly to AWS too.
Re-running the `10K`-chain benchmark at:

- `m=32`
- `ef_construction=200`
- `ann_k=64`
- `sorted_hnsw.ef_search=192`

produced:

- heap two-hop SQL
  - `1.896 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 95.3%`
- `sorted_heap_expand_twohop_rerank()`
  - `1.617 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 95.3%`
- `sorted_heap_graph_rag_twohop_scan()`
  - `1.687 ms`
  - `hit@1 = 76.6%`
  - `hit@k = 95.3%`

So the larger-scale picture is now materially stronger:

- the `10K`-chain quality drop is cross-environment robust
- the best current larger-graph recovery point is also cross-environment
  robust: `m=32 / ef_search=192`
- but even that recovery point does **not** restore the earlier `5K`-chain
  `98.4% hit@k` AWS frontier
- so the remaining gap is unlikely to be solved by another trivial
  `ef_search` or `m` tweak alone

### Exact-seed upper-bound diagnostic

The next root-cause check was to remove ANN approximation from the seed stage
entirely. The multihop harness now supports an `--exact-seed-diagnostics`
mode, which replaces ANN seed retrieval with exact brute-force top-K seeds on
`facts_heap`, then reuses the same graph expansion/rerank path.

This matters because it separates two very different explanations:

- "the remaining gap is caused by approximate ANN seeds"
- "the remaining gap is already in the benchmark/query/task shape"

On the `5K`-chain balanced local point:

- `m=24`
- `ef_construction=200`
- `ann_k=64`
- `sorted_hnsw.ef_search=128`

the exact-seed diagnostic did **not** improve quality:

- ANN-seeded `sorted_heap_expand_twohop_rerank()`
  - `0.702 ms`
  - `hit@1 = 75.0%`
  - `hit@k = 96.9%`
- exact-seeded `sorted_heap_expand_twohop_rerank()`
  - `0.811 ms`
  - `hit@1 = 75.0%`
  - `hit@k = 96.9%`

And the seed-stage diagnostic showed no hidden ANN loss there either:

- ANN seeds
  - `seed_person_pct = 98.4%`
  - `expanded_city_pct = 98.4%`
  - `avg_person_rank = 1.00`
  - `city_rank_p95 = 6`
  - `city_rank_max = 17`
- exact seeds
  - `seed_person_pct = 98.4%`
  - `expanded_city_pct = 98.4%`
  - `avg_person_rank = 1.00`
  - `city_rank_p95 = 6`
  - `city_rank_max = 17`

So even at `5K`, the final `96.9% hit@k` is already below seed coverage.
But the rerank distribution is still concentrated: the correct city stays
within the top 6 for 95% of reachable queries, and the miss comes from a
small number of sharper outliers.

On the `10K`-chain balanced local point:

- `m=24`
- `ef_construction=200`
- `ann_k=64`
- `sorted_hnsw.ef_search=128`

the exact-seed diagnostic again did **not** improve quality:

- ANN-seeded `sorted_heap_expand_twohop_rerank()`
  - `0.839 ms`
  - `hit@1 = 71.9%`
  - `hit@k = 92.2%`
- exact-seeded `sorted_heap_expand_twohop_rerank()`
  - `0.947 ms`
  - `hit@1 = 71.9%`
  - `hit@k = 92.2%`

The seed-stage diagnostic was even more revealing on `10K`:

- ANN seeds
  - `seed_person_pct = 96.9%`
  - `expanded_city_pct = 96.9%`
  - `avg_person_rank = 1.00`
  - `city_rank_p95 = 3`
  - `city_rank_max = 20`
- exact seeds
  - `seed_person_pct = 96.9%`
  - `expanded_city_pct = 96.9%`
  - `avg_person_rank = 1.00`
  - `city_rank_p95 = 3`
  - `city_rank_max = 19`

So the larger-graph gap is **not** coming from missing the correct seed fact.
At `10K`, seed coverage stays at `96.9%`, but final `hit@k` drops to `92.2%`.
And it is not a broad rerank collapse either: for 95% of reachable queries the
correct city still ranks in the top 3, but a few outliers fall as far as
rank `19-20`, which is enough to miss `top_k = 10`.

This is a strong falsifier:

- on this synthetic fact benchmark, the current `5K` and `10K` frontiers are
  **not** ANN-approximation limited at the tested operating points
- ANN and exact seeds have identical seed coverage on both scales
- the remaining gap is mostly an outlier-ranking problem, not a broad seed or
  rerank failure
- exact seeds cost extra latency but do not recover answer quality
- so the next meaningful gain is unlikely to come from more seed-ANN tuning
  alone

The remaining gap now looks more like a property of the task construction,
query embedding, or graph benchmark semantics than of `sorted_hnsw`
approximation itself. More specifically: the dominant remaining loss now looks
downstream of seed retrieval, not inside it, and it is concentrated in a small
set of bad cases rather than a general degradation across the query set.

So the honest story on this fact benchmark is a latency/quality frontier:

- `sorted_heap_expand_twohop_rerank()` leads on latency

### Path-aware rerank diagnostic

The next falsifier was to keep the same ANN seeds and the same two-hop
expansion, but change only the final scorer. The current multihop helper
reranks on the hop-2 city fact embedding alone. A path-aware SQL baseline was
added to the harness that scores each candidate as:

- `path_distance = (hop1_embedding <=> query) + (hop2_embedding <=> query)`

That simple change materially improved answer quality on the same balanced
points:

- `5K` chains, `m=24`, `ef_construction=200`, `ann_k=64`,
  `sorted_hnsw.ef_search=128`
  - city-only `sorted_heap_graph_rag_twohop_scan()`
    - `0.762 ms`
    - `hit@1 = 75.0%`
    - `hit@k = 96.9%`
  - path-aware SQL rerank on `facts_sh`
    - `0.957 ms`
    - `hit@1 = 98.4%`
    - `hit@k = 98.4%`

- `10K` chains, same knobs
  - city-only `sorted_heap_graph_rag_twohop_scan()`
    - `0.937 ms`
    - `hit@1 = 71.9%`
    - `hit@k = 92.2%`
  - path-aware SQL rerank on `facts_sh`
    - `1.179 ms`
    - `hit@1 = 95.3%`
    - `hit@k = 96.9%`

This is the strongest current architectural signal on the fact-shaped
benchmark:

- the remaining quality gap is not well explained by seed recall
- it is also not well explained by broad rerank collapse
- a simple path-aware scorer recovers most of the lost quality with only a
  modest latency increase

That branch is now implemented locally too:

- `sorted_heap_expand_twohop_path_rerank(...)`
- `sorted_heap_graph_rag_twohop_path_scan(...)`

And the fused helper beats the SQL path-aware baseline on the same balanced
points:

- `5K` chains
  - SQL path-aware baseline: `0.847 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%`
  - fused helper: `0.726 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%`
  - one-call wrapper: `0.739 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%`

- `10K` chains
  - SQL path-aware baseline: `0.942 ms`, `hit@1 = 95.3%`, `hit@k = 96.9%`
  - fused helper: `0.823 ms`, `hit@1 = 95.3%`, `hit@k = 96.9%`
  - one-call wrapper: `0.834 ms`, `hit@1 = 95.3%`, `hit@k = 96.9%`

So for multihop fact retrieval, the next serious question is no longer
whether path-aware rerank helps. It does. The next question is whether this
new helper/wrapper transfers cleanly to AWS and then to a real
`cogniformerus`-like corpus.

That AWS transfer is now verified too. On AWS ARM64 (`4 vCPU`, `8 GiB RAM`),
at the same balanced `m=24 / ef_construction=200 / ann_k=64 / ef_search=128`
point:

- `5K` chains
  - heap two-hop SQL: `1.088 ms`, `hit@1 = 75.0%`, `hit@k = 96.9%`
  - city-only wrapper: `1.012 ms`, `hit@1 = 75.0%`, `hit@k = 96.9%`
  - SQL path-aware baseline: `1.204 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%`
  - fused helper: `0.955 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%`
  - one-call path-aware wrapper: `1.018 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%`
  - pgvector + heap expansion, same path-aware scorer: `1.422 ms`,
    `hit@1 = 85.9%`, `hit@k = 85.9%`
  - zvec + heap expansion, same path-aware scorer: `1.720 ms`,
    `hit@1 = 100.0%`, `hit@k = 100.0%`
  - Qdrant + heap expansion, same path-aware scorer: `3.435 ms`,
    `hit@1 = 100.0%`, `hit@k = 100.0%`

- `10K` chains, same knobs
  - heap two-hop SQL: `1.319 ms`, `hit@1 = 71.9%`, `hit@k = 92.2%`
  - city-only wrapper: `1.197 ms`, `hit@1 = 73.4%`, `hit@k = 93.8%`
  - SQL path-aware baseline: `1.436 ms`, `hit@1 = 96.9%`, `hit@k = 98.4%`
  - fused helper: `1.185 ms`, `hit@1 = 96.9%`, `hit@k = 98.4%`
  - one-call path-aware wrapper: `1.212 ms`, `hit@1 = 96.9%`, `hit@k = 98.4%`

So the answer to the transfer question is now yes: the path-aware helper and
wrapper survive the AWS move cleanly, and the old larger-scale caveat narrows
substantially once the rerank contract is fixed.

This also closes the earlier apples-to-apples gap. Once all engines are scored
under the same path-aware contract:

- `sorted_heap` is the latency leader
- `zvec` and Qdrant hold the strongest observed answer quality
- `pgvector` remains behind on both latency and quality at this operating point

One AWS all-engines rerun briefly dropped the `sorted_heap` path-aware rows to
`96.9% / 96.9%`, but an immediate `sorted_heap`-only control and a second
full rerun both returned `98.4% / 98.4%`. So the portable parity story now has
one verified outlier plus two confirming reruns. That was enough to justify
the benchmark note, and it directly motivated the repeated-build protocol
recorded below.

## Repeated-build local variance

- [`scripts/repeat_graph_rag_multihop_builds.py`](../scripts/repeat_graph_rag_multihop_builds.py)
- [`scripts/repeat_graph_rag_multihop_builds_aws.sh`](../scripts/repeat_graph_rag_multihop_builds_aws.sh)

It wraps [`scripts/bench_graph_rag_multihop.py`](../scripts/bench_graph_rag_multihop.py)
so each repeat gets a fresh temp cluster and a fresh HNSW build, then reports
median / min / max for selected rows.

On the balanced local `5K` point (`m=24 / ef_construction=200 / ann_k=64 /
ef_search=128`), three independent rebuilds produced:

- `sorted_heap_expand_twohop_path_rerank()`
  - `p50_ms`: median `0.798`, range `0.771-0.819`
  - `hit@1 = 98.4%`, `hit@k = 98.4%` on all three builds
- `sorted_heap_graph_rag_twohop_path_scan()`
  - `p50_ms`: median `0.796`, range `0.778-0.804`
  - `hit@1 = 98.4%`, `hit@k = 98.4%` on all three builds
- `pgvector` path-aware parity row
  - `p50_ms`: median `1.405`, range `1.318-1.456`
  - `hit@1/hit@k`: `85.9-89.1%`
- `zvec` path-aware parity row
  - `p50_ms`: median `1.076`, range `1.053-1.087`
  - `hit@1 = 100.0%`, `hit@k = 100.0%` on all three builds
- `Qdrant` path-aware parity row
  - `p50_ms`: median `2.799`, range `2.792-2.805`
  - `hit@1 = 100.0%`, `hit@k = 100.0%` on all three builds

So the balanced local path-aware `sorted_heap` point is not just a lucky
single build. The answer quality stayed fixed across rebuilds, and the
latency spread was narrow. The remaining variance story now looks more like:

- local balanced `sorted_heap`: stable across rebuilds
- AWS balanced `sorted_heap`: also stable across repeated builds on the `5K`
  point, with one earlier outlier now downgraded to an anomaly
- `pgvector`: measurable quality drift across local rebuilds
- `zvec` / `Qdrant`: stable on this deterministic local fact graph

The AWS repeated-build protocol on the balanced `5K` point produced:

- `sorted_heap_expand_twohop_path_rerank()`
  - `p50_ms`: median `0.962`, range `0.956-0.965`
  - `hit@1 = 98.4%`, `hit@k = 98.4%` on all three builds
- `sorted_heap_graph_rag_twohop_path_scan()`
  - `p50_ms`: median `1.025`, range `1.018-1.043`
  - `hit@1 = 98.4%`, `hit@k = 98.4%` on all three builds
- `pgvector` path-aware parity row
  - `p50_ms`: median `1.434`, range `1.370-1.493`
  - `hit@1/hit@k`: `84.4-89.1%`
- `zvec` path-aware parity row
  - `p50_ms`: median `1.711`, range `1.703-1.768`
  - `hit@1 = 100.0%`, `hit@k = 100.0%` on all three builds
- `Qdrant` path-aware parity row
  - `p50_ms`: median `3.355`, range `3.302-3.465`
  - `hit@1 = 100.0%`, `hit@k = 100.0%` on all three builds

So the current confidence picture is stronger than before:

- local balanced `5K`: repeated-build stable
- AWS balanced `5K`: repeated-build stable
- larger `10K` AWS path-aware rows: repeated-build stable too, but at a lower
  quality frontier than `5K`

The AWS repeated-build protocol on the larger `10K` point produced:

- `sorted_heap_expand_twohop_path_rerank()`
  - `p50_ms`: median `1.177`, range `1.148-1.191`
  - `hit@1 = 95.3%`, `hit@k = 96.9%` on all three builds
- `sorted_heap_graph_rag_twohop_path_scan()`
  - `p50_ms`: median `1.236`, range `1.211-1.240`
  - `hit@1 = 95.3%`, `hit@k = 96.9%` on all three builds
- `pgvector` path-aware parity row
  - `p50_ms`: median `1.667`, range `1.665-1.676`
  - `hit@1/hit@k`: `76.6-82.8%`
- `zvec` path-aware parity row
  - `p50_ms`: median `2.788`, range `2.762-2.789`
  - `hit@1 = 98.4%`, `hit@k = 100.0%` on all three builds
- `Qdrant` path-aware parity row
  - `p50_ms`: median `3.818`, range `3.788-3.846`
  - `hit@1 = 98.4%`, `hit@k = 100.0%` on all three builds

This sharpens the conclusion again:

- the `10K` AWS point is no longer a variance question
- it is a real scale frontier
- `sorted_heap` remains the latency leader there
- `zvec` and Qdrant still lead on answer quality

This also falsifies one tempting but wrong simplification:

> once the helper is fast, the remaining GraphRAG problem is solved

Not quite. On fact-shaped multihop queries, seed ANN quality and graph build
quality still matter enough that `ann_k`, `ef_search`, and graph build quality
remain first-class tuning knobs. But the old hop-2-only rerank contract was a
separate, larger problem, and the new path-aware helper fixes most of it on
the current local benchmark.

## Current verdict

`sorted_heap` already has a plausible GraphRAG foundation, and the new helper
proves that a narrow C primitive can materially improve the GraphRAG path.

What is now true:

- SQL-only GraphRAG composition was not enough
- `sorted_heap_expand_ids()` is enough to recover a large part of that gap
- `sorted_heap_expand_rerank()` recovers most of the rerank overhead on the
  current `sorted_heap` path
- `sorted_heap_graph_rag_scan()` makes the composition available as a single
  SQL call without giving back much latency
- `sorted_heap_expand_twohop_rerank()` turns the earlier two-hop composition
  evidence into a real latency win on the real-text Gutenberg slices we tested
- on the cogniformerus-style `person -> parent -> city` benchmark, the fused
  two-hop helper is the fastest PostgreSQL path we tested
- `sorted_heap_graph_rag_twohop_scan()` closes the current fact-shaped wrapper
  gap without materially giving back latency
- `sorted_heap_expand_twohop_path_rerank()` upgrades the fact-shaped rerank
  contract to use hop-1 and hop-2 evidence together
- `sorted_heap_graph_rag_twohop_path_scan()` makes that path-aware contract
  available as a single-call primitive
- the path-aware helper and wrapper transfer cleanly from local to AWS ARM64
  on the same balanced `m=24 / ef_construction=200 / ann_k=64 /
  ef_search=128` point
- the narrow-helper direction is a justified building block
- the current helper model already composes into a competitive two-hop
  real-text GraphRAG path on Gutenberg without requiring a new graph API
- on the real-text GraphRAG shape, `pgvector` parity is already materially
  worse end-to-end than the fused `sorted_heap` helper path
- on the fact-shaped AWS path-aware benchmark, `sorted_heap` is now the
  fastest verified end-to-end path, while `zvec` and Qdrant remain the answer
  quality leaders
- `zvec` is stable on the medium slice but currently not robust on the larger
  real-text slice at `ann_k=32`
- `Qdrant` is robust on both real-text slices but materially slower than the
  fused `sorted_heap` helper on the same workflow

What is not yet true:

- `sorted_heap` is not yet clearly better than heap+btree on pure expansion
  latency for this synthetic workload
- even the relation-filtered GraphRAG path still trails heap+btree slightly on
  this synthetic benchmark
- two-hop helper composition is not yet a universal latency win; at higher
  rerank dimensions it narrows to parity with heap+btree rather than staying
  clearly ahead
- the current benchmark suite is still deterministic/synthetic rather than a
  real `cogniformerus` corpus, so the remaining generalization gap is about
  workload realism more than about build variance
- transfer to a larger real `cogniformerus` corpus is still unverified; the
  current fact-shaped benchmark is deterministic and synthetic even though it
  matches the intended multihop query shape

## Actual Butler gate seed-corpus smoke

The next honest step after the synthetic-chain work was to stop guessing and
run the path-aware GraphRAG helpers on the actual tiny multihop corpus that
`cogniformerus` already ships in its Butler gate smoke:

- source: `cogniformerus/bin/butler_small_model_eval.cr`
- repo-owned fixture:
  [`scripts/fixtures/graph_rag_butler_gate_seed.json`](../scripts/fixtures/graph_rag_butler_gate_seed.json)
- harness:
  [`scripts/bench_graph_rag_butler_gate.py`](../scripts/bench_graph_rag_butler_gate.py)

This fixture is intentionally tiny:

- `7` graph facts loaded into `facts_heap` / `facts_sh`
- `2` positive multihop queries
  - `Project Atlas -> Orion -> Helsinki`
  - `Release 13 -> Aurora -> April`

So it is not a publishable latency frontier. Its job is narrower:

- verify that the current path-aware helper and wrapper work on the real Butler
  gate fact texts and prompts
- replace the previous blanket statement "real cogniformerus still unverified"
  with a tighter one: the actual gate seed corpus is covered, but larger real
  corpora are not

The first local smoke run on this real gate seed corpus used:

- `384D`
- `ann_k=4`
- `top_k=4`
- `m=24`
- `ef_construction=200`
- `sorted_hnsw.ef_search=64`
- `5` timing runs on a fresh temp cluster

Result:

- heap path-aware SQL baseline:
  - `p50 0.027 ms`
  - `hit@1/hit@k = 100/100`
- `facts_sh` path-aware SQL baseline:
  - `p50 0.026 ms`
  - `hit@1/hit@k = 100/100`
- `sorted_heap_expand_twohop_path_rerank()`:
  - `p50 0.017 ms`
  - `hit@1/hit@k = 100/100`
- `sorted_heap_graph_rag_twohop_path_scan()`:
  - `p50 0.045 ms`
  - `hit@1/hit@k = 100/100`

This does not prove scale behavior. It proves something narrower and still
useful: the current path-aware GraphRAG helper/wrapper contract works on the
actual Butler gate seed facts and prompts, not only on the synthetic
`person -> parent -> city` generator.

One adversary control also mattered here: this was not only a pass at a
near-full seed budget. Re-running the same smoke at `ann_k=2`, `top_k=2`
still kept both multihop queries at `100/100`.

The correct next step is therefore:

> **tune the current narrow helper family before considering a bigger
> graph-specific subsystem**

That remains the smallest change that can still convert the observed
block-pruning advantage into an end-to-end query win.

## Real code-corpus prototype

The next honest check after the Butler gate fact smoke was not another
synthetic graph. It was the actual `cogniformerus` code corpus plus the real
cross-file question bank already used by Butler's own code benchmark.

- source tree:
  `cogniformerus/src/cogniformerus`
- question source:
  `cogniformerus/bin/butler_code_test.cr`
- harness:
  [`scripts/bench_graph_rag_code_corpus.py`](../scripts/bench_graph_rag_code_corpus.py)

This harness builds a narrow code-GraphRAG shape:

- each source file is one entity
- each chunk in that file becomes one fact row
  - `entity_id = file_id`
  - `relation_id = HAS_CHUNK`
  - `target_id = chunk_id`
- query quality is scored against the real CrossFile benchmark keywords
  from `butler_code_test.cr`

This is not a full code graph. It is a bounded falsifier for a simpler claim:

> if GraphRAG-style seeded expansion is already useful on a real corpus, it
> should show up even on the natural `file -> chunk` expansion shape

The first stable local point used:

- `40` files
- `747` chunk rows
- `6` real CrossFile questions
- `384D`
- `ann_k=16`
- `top_k=4`
- `m=24`
- `ef_construction=200`
- `sorted_hnsw.ef_search=64`
- `shared_buffers=64MB`
- fresh backend
- `3` timing runs

Result:

- direct ANN over raw chunks:
  - heap: `p50 0.740 ms`
  - `sorted_heap`: `p50 0.712 ms`
  - keyword coverage: `63.3%`
  - full-keyword hits: `33.3%`
- file-seeded SQL expansion:
  - heap: `p50 0.516 ms`
  - `sorted_heap`: `p50 0.468 ms`
  - same `63.3%` keyword coverage
  - same `33.3%` full-keyword hits
- `sorted_heap_expand_rerank()` helper:
  - `p50 0.665 ms`
  - same `63.3%` keyword coverage
  - same `33.3%` full-keyword hits

The important conclusion is narrow but real:

- the real code corpus branch is now reproducible inside this repository
- seeded expansion by file preserves answer-support quality on the real
  CrossFile question set
- on this code corpus, the current gain is **latency**, not answer quality
- the helper is not yet the latency leader on this tiny real corpus; the
  simple SQL expansion shape still wins locally

This means the next code-corpus GraphRAG step is not "invent a bigger graph
API". It is either:

- a richer real code-graph relation hypothesis than plain `file -> chunk`, or
- a lower-overhead helper path for this very simple expansion contract

### Real `require`-graph falsifier

The obvious next hypothesis was that plain `file -> chunk` was too weak, and
that the real local code graph should help once actual `require` edges were
present.

That hypothesis is now tested in the same harness:

- `53` local `require` edges derived from the real `cogniformerus` source tree
- relation `REQUIRES_FILE`
- two new query shapes:
  - `seed_require_twohop_*`
  - `seed_file_plus_require_in`

Stable local result on the same `40`-file / `800`-row / `6`-question point,
`3` runs:

- plain file-seeded expansion:
  - `sorted_heap`: `0.471 ms`
  - keyword coverage: `63.3%`
  - full hits: `33.3%`
- file plus required files:
  - `sorted_heap`: `0.605 ms`
  - same `63.3%` keyword coverage
  - same `33.3%` full hits
- dependency-only two-hop:
  - `sorted_heap`: `0.391 ms`
  - keyword coverage: `20.0%`
  - full hits: `0.0%`

So the richer real relation hypothesis is currently **refuted** on this code
corpus:

- adding dependency files does not improve answer-support quality
- dependency-only traversal is actively worse because it drops own-file context
- unioning own files with required files only adds cost, not quality

This is a useful stopping point. The next likely win for real code-GraphRAG is
not "just add more code edges". It is a different retrieval contract or a
lower-overhead helper path on the already-good file-seeded shape.

### File-summary seed falsifier

The next retrieval-contract hypothesis was also tested locally on the same real
code corpus:

- add one synthetic-but-data-derived summary row per file
- seed on those summary rows
- then expand back to the file's chunk rows

The goal was to test whether the missing factor was simply that chunk-level ANN
was a poor way to choose files.

That also failed to improve answer-support quality.

Stable smoke result on the same `40`-file / `840`-row / `6`-question point:

- summary-seeded expansion:
  - heap: `0.587 ms`
  - `sorted_heap`: `0.564 ms`
  - keyword coverage: `63.3%`
  - full hits: `33.3%`

So the current real code-corpus plateau is now bounded more tightly:

- plain file-seeded expansion: same quality, lower latency
- file summaries: same quality, higher latency
- require edges: no quality gain
- require-only traversal: quality regression

That strongly suggests the next code-corpus GraphRAG branch should not be
"more local graph structure" or "better file seeds" in the same lexical setup.
The remaining frontier is more likely one of:

- a different quality metric / question contract,
- better embeddings,
- or a lower-overhead execution path on the already-best file-seeded shape.

### Oracle-seed and oracle-rerank diagnostic

The next adversary question was sharper:

> is the plateau really about bad file seeds, or is it already downstream in
> the rerank / evaluation contract?

The harness now includes two explicit oracle diagnostics on the same real code
corpus:

- **oracle file seeds**
  - choose seed files by benchmark-keyword overlap against the full file text
  - this is not a deployable retrieval contract; it is a diagnostic ceiling
- **prompt-derived lexical rerank**
  - keep the same ANN-derived file seeds
  - rerank by lexical overlap with terms extracted from the actual user prompt
  - this is deployable in principle, but much weaker than the oracle signal
- **oracle keyword rerank**
  - keep the same ANN-derived file seeds
  - rerank the expanded chunk rows by direct overlap with the benchmark's gold
    CrossFile keywords before falling back to embedding distance

Stable local result, `3` runs, same `40`-file / `840`-row / `6`-question point:

- plain file-seeded expansion:
  - `sorted_heap`: `0.443 ms`
  - keyword coverage: `63.3%`
  - full hits: `33.3%`
- oracle file seeds:
  - `sorted_heap`: `0.416 ms`
  - same `63.3%` keyword coverage
  - same `33.3%` full hits
- prompt-derived lexical rerank:
  - `sorted_heap`: `3.005 ms`
  - same `63.3%` keyword coverage
  - worse `16.7%` full hits
- oracle keyword rerank:
  - heap: `2.905 ms`
  - `sorted_heap`: `2.944 ms`
  - keyword coverage: `90.0%`
  - full hits: `66.7%`

This is a useful but narrow falsifier:

- the plateau is **not** explained by weak file seeds alone
- richer local graph structure also did not explain it
- a simple prompt-term rerank at `top_k=4` also did not explain it
- but once the rerank contract is allowed to use the benchmark's own gold
  keywords, quality jumps sharply

That does **not** justify a product claim, because the oracle rerank is using
the same keyword signal that the benchmark later scores. It does justify a more
targeted next hypothesis:

> the remaining quality frontier on the real code corpus is more likely in the
> query/rerank contract or embedding space than in local graph topology or seed
> selection

### Result-budget and packing diagnostic

The broad "cheap lexical hybrid does not help" claim turned out to be too
strong once the same real code-corpus harness was rerun at larger result
budgets.

Bounded local sweep, same `40`-file / `840`-row / `6`-question corpus,
`ann_k=16`, `3` runs:

- plain file-seeded `sorted_heap` expansion:
  - `top_k=4`: `0.402 ms`, `63.3%` keyword coverage, `33.3%` full hits
  - `top_k=8`: `0.460 ms`, `68.1%` keyword coverage, same `33.3%` full hits
  - `top_k=16`: `0.469 ms`, `84.3%` keyword coverage, same `33.3%` full hits
  - `top_k=32`: `0.449 ms`, `94.3%` keyword coverage, `66.7%` full hits
- prompt-derived lexical rerank:
  - `top_k=4`: `3.005 ms`, `63.3%`, `16.7%`
  - `top_k=8`: `3.176 ms`, `86.7%`, `50.0%`
  - `top_k=12`: `3.149 ms`, `90.0%`, `66.7%`
  - `top_k=32`: `3.147 ms`, `96.7%`, `83.3%`

So the real code-corpus plateau is not just a seed-quality problem. It is also
partly a **result-budget / packing** problem:

- with more rows, even the plain file-seeded path recovers much more keyword
  coverage
- prompt-derived lexical rerank starts to help only once the row budget is not
  extremely tight

That makes the next bounded hypothesis more specific:

> the remaining small-`top_k` gap is likely about how evidence is packed into a
> tiny chunk budget, not about choosing better files

One more diagnostic supports that narrower claim. On the original `top_k=4`
point, a diversity-aware prompt-term rerank was also tested:

- `sorted_heap` prompt-diverse rerank:
  - `3.229 ms`
  - `76.7%` keyword coverage
  - still only `33.3%` full hits

That is a partial gain in coverage, but still not the qualitative jump needed
to make the current small-budget contract compelling.

### Code-aware embedding diagnostic

The next bounded hypothesis was exactly what the code corpus suggests:

> maybe the remaining gap is not just about rerank logic, but about the fact
> that the current harness still uses a Gutenberg-style lexical tokenizer that
> does not understand `CamelCase` or `_snake_case` identifiers well

The harness now supports two embedding modes:

- `generic`
  - existing lexical hash over generic text tokens
- `code_aware`
  - keeps the full code token, but also splits identifiers on `_` and
    `CamelCase` before hashing

Stable local comparison on the same real `40`-file / `840`-row / `6`-question
point, `ann_k=16`, `top_k=4`, `3` runs:

- plain file-seeded `sorted_heap` expansion:
  - `generic`: `0.450 ms`, `63.3%` keyword coverage, `33.3%` full hits
  - `code_aware`: `0.427 ms`, `61.4%` keyword coverage, `16.7%` full hits
- prompt-diverse rerank:
  - `generic`: `3.178 ms`, `76.7%`, `33.3%`
  - `code_aware`: `3.351 ms`, `76.7%`, `50.0%`
- oracle keyword rerank:
  - `generic`: `2.672 ms`, `90.0%`, `66.7%`
  - `code_aware`: `2.435 ms`, `96.7%`, `83.3%`

This is another mixed but useful falsifier:

- code-aware tokenization is **not** a free win by itself
- plain ANN + file expansion actually got slightly worse
- but once combined with a diversity-aware rerank, the same code-aware mode
  did improve the small-budget `full_pct`

So the current code-corpus frontier is now even narrower:

> the next likely win is not "better seeds" or "more edges", but a tighter
> coupling between code-aware embeddings and a smarter small-budget rerank /
> packing contract

### Summary-output packing win

The next bounded hypothesis was the most direct one implied by the previous
diagnostics:

> if the real bottleneck is small-budget packing, then maybe raw chunks are
> simply the wrong final output unit for this code benchmark

The harness already materializes one summary row per file. The new test keeps
the same ANN-derived file seeds, but returns file summaries as the final output
rows instead of raw chunks.

Stable local result on the same real `40`-file / `840`-row / `6`-question
point, `ann_k=16`, `top_k=4`, `3` runs:

- generic embedding mode:
  - chunk output (`seed_file_expand_in`, `sorted_heap`):
    - `0.418 ms`
    - `63.3%` keyword coverage
    - `33.3%` full hits
  - summary output (`seed_file_summary_output_in`, `sorted_heap`):
    - `0.200 ms`
    - `71.0%` keyword coverage
    - `33.3%` full hits
  - prompt summary rerank (`prompt_summary_rerank_in`, `sorted_heap`):
    - `0.318 ms`
    - `73.3%` keyword coverage
    - `50.0%` full hits
- code-aware embedding mode:
  - chunk output:
    - `0.418 ms`
    - `61.4%`
    - `16.7%`
  - summary output:
    - `0.207 ms`
    - `77.6%`
    - `33.3%`
  - prompt summary rerank:
    - `0.426 ms`
    - `77.6%`
    - `33.3%`

This is the first clean small-budget win on the real code corpus:

- summary rows are a better packing unit than raw chunks at `top_k=4`
- they improve coverage while also reducing latency
- in the generic mode, prompt-aware reranking over summaries also improves
  `full_pct`

So the current strongest product-facing hypothesis is no longer "better seeds"
or "more graph edges". It is:

> for real code GraphRAG, file summaries are a stronger final output unit than
> raw chunks when the answer budget is tiny

### Summary rows as seed unit

The next narrow question was whether summaries are only a better **output**
unit, or also a better **seed** unit.

That was tested by forcing the ANN seed step to rank only `REL_FILE_SUMMARY`
rows and then keeping the final result set on summaries as well.

Stable local result on the same real `40`-file / `840`-row / `6`-question
point, `ann_k=16`, `top_k=4`, `3` runs:

- generic embedding mode:
  - summary output from mixed ANN seeds (`seed_file_summary_output_in`,
    `sorted_heap`):
    - `0.199 ms`
    - `71.0%` keyword coverage
    - `33.3%` full hits
  - summary output from summary-only seeds
    (`summary_seed_summary_output_in`, `sorted_heap`):
    - `0.116 ms`
    - `77.6%` keyword coverage
    - `33.3%` full hits
  - prompt summary rerank from mixed seeds:
    - `0.329 ms`
    - `73.3%`
    - `50.0%`
  - prompt summary rerank from summary-only seeds:
    - `0.541 ms`
    - `74.3%`
    - `33.3%`
- code-aware embedding mode:
  - mixed-seed summary output:
    - `0.193 ms`
    - `77.6%`
    - `33.3%`
  - summary-only seed summary output:
    - `0.112 ms`
    - `64.3%`
    - `33.3%`

So the current tiny-budget frontier is now split into two clear points:

- **fastest coverage point** on this corpus:
  - generic embedding mode
  - summary-only seeds
  - summary output
- **best full-hit point** on this corpus:
  - generic embedding mode
  - mixed ANN seeds
  - prompt-aware summary rerank

And one more falsifier is now clear:

> summary rows are not universally a better seed unit; the benefit depends on
> the embedding mode and the final scoring contract

### Summary-plus-chunk hybrid output

The next bounded question was whether the best tiny-budget contract should stay
purely on summaries, or whether a hybrid output can do better:

> use summaries to choose the right files, but also emit one best chunk from
> each selected file so the final answer set contains both compressed context
> and one concrete code span

That was tested in two variants:

- mixed ANN seeds -> summary ranking -> one best chunk per selected file
- summary-only seeds -> summary ranking -> one best chunk per selected file

Stable local result on the same real `40`-file / `840`-row / `6`-question
point, `ann_k=16`, `top_k=4`, `3` runs:

- generic embedding mode:
  - best prior full-hit point: mixed-seed prompt summary rerank
    - `0.363 ms`
    - `73.3%`
    - `50.0%`
  - mixed-seed summary+chunk hybrid:
    - `1.481 ms`
    - `84.3%`
    - `33.3%`
  - summary-seeded summary+chunk hybrid:
    - `1.627 ms`
    - `78.1%`
    - `50.0%`
- code-aware embedding mode:
  - best prior summary-only point:
    - prompt summary rerank
    - `0.372 ms`
    - `77.6%`
    - `33.3%`
  - mixed-seed summary+chunk hybrid:
    - `1.616 ms`
    - `84.3%`
    - `50.0%`
  - summary-seeded summary+chunk hybrid:
    - `1.688 ms`
    - `77.6%`
    - `33.3%`

So this branch narrows the frontier again:

- hybrid output is **not** a universal improvement
- for the generic mode, pure summary rerank remains the better tiny-budget
  full-hit point
- for the code-aware mode, mixed-seed summary+chunk hybrid is the first path
  that reaches `50.0%` full hits at `top_k=4`

That means the current strongest small-budget choices are now split:

- **generic mode**:
  - summaries-only remain the better contract
- **code-aware mode**:
  - hybrid summary+chunk output is now the better contract

### Fixed-ratio hybrid packing

The previous hybrid branch still left one obvious ambiguity:

> was the hybrid result about having both summaries and chunks at all, or just
> about how many summary slots the tiny `top_k=4` budget reserved?

That was tested with two fixed-ratio mixed-seed hybrids:

- **summary-light**: `1` summary + `3` chunk slots
- **summary-heavy**: `3` summary slots + `1` chunk slot

Stable local result on the same real `40`-file / `840`-row / `6`-question
point, `ann_k=16`, `top_k=4`, `3` runs:

- generic embedding mode:
  - prior best full-hit point:
    - prompt summary rerank
    - `0.337 ms`
    - `73.3%`
    - `50.0%`
  - prior balanced hybrid:
    - `1.490 ms`
    - `84.3%`
    - `33.3%`
  - summary-light hybrid:
    - `1.753 ms`
    - `80.0%`
    - `33.3%`
  - summary-heavy hybrid:
    - `1.057 ms`
    - `86.7%`
    - `50.0%`
- code-aware embedding mode:
  - prior best point:
    - balanced hybrid
    - `1.566 ms`
    - `84.3%`
    - `50.0%`
  - summary-light hybrid:
    - `2.246 ms`
    - `68.1%`
    - `33.3%`
  - summary-heavy hybrid:
    - `0.879 ms`
    - `84.3%`
    - `50.0%`

This resolves the remaining hybrid ambiguity:

- the hybrid win is **not** about chunks in general
- it is specifically about reserving a small number of chunk slots while
  keeping the budget summary-heavy

So the refined tiny-budget frontier is now:

- **generic mode**:
  - best latency/full-hit tradeoff: pure prompt summary rerank
  - best coverage at the same full-hit level: summary-heavy hybrid
- **code-aware mode**:
  - summary-heavy hybrid is now the strongest point

### Summary-heavy hybrid with summary-only seeds

The remaining seed question after the fixed-ratio result was very narrow:

> if the winning hybrid is already summary-heavy, should its seed unit also be
> switched fully to summaries?

That was tested directly against the current summary-heavy mixed-seed hybrid.

Stable local result on the same real `40`-file / `840`-row / `6`-question
point, `ann_k=16`, `top_k=4`, `3` runs:

- generic embedding mode:
  - prompt summary rerank:
    - `0.395 ms`
    - `73.3%`
    - `50.0%`
  - mixed-seed summary-heavy hybrid:
    - `1.062 ms`
    - `86.7%`
    - `50.0%`
  - summary-seeded summary-heavy hybrid:
    - `1.175 ms`
    - `87.6%`
    - `50.0%`
- code-aware embedding mode:
  - prompt summary rerank:
    - `0.390 ms`
    - `77.6%`
    - `33.3%`
  - mixed-seed summary-heavy hybrid:
    - `0.965 ms`
    - `84.3%`
    - `50.0%`
  - summary-seeded summary-heavy hybrid:
    - `0.981 ms`
    - `77.6%`
    - `33.3%`

This closes the seed-unit branch for the current frontier:

- **generic mode**:
  - summary-only seeds can squeeze out a tiny extra coverage gain, but they do
    not improve full hits and they cost more latency than the mixed-seed
    summary-heavy hybrid
- **code-aware mode**:
  - summary-only seeds are clearly worse; the mixed-seed summary-heavy hybrid
    remains the strongest point

### Per-question failure pattern

Aggregate percentages were no longer enough to guide the next branch, so the
real code-corpus harness now supports targeted diagnostics:

- `--case-filter`
- `--report-questions`

That was used to inspect the current best generic and code-aware contracts on
the exact CrossFile prompts from `butler_code_test.cr`.

Stable local diagnostic, same `40`-file / `840`-row / `6`-question point,
`ann_k=16`, `top_k=4`, `3` runs:

- **generic mode**
  - best latency/full-hit point:
    - `prompt_summary_rerank_in`
  - best coverage/full-hit point:
    - `prompt_summary_chunk_hybrid_s3_in`
- **code-aware mode**
  - best point:
    - `prompt_summary_chunk_hybrid_s3_in`

The important result is not just the percentages, but **which** questions stay
hard:

- `Response memory policy`
  - still misses under all current best contracts
  - current quality stays around `40.0%`
- `Streaming overlap`
  - still misses under all current best contracts
  - current quality stays around `80.0%`
- `Butler response routing`
  - generic contracts still miss it
  - code-aware summary-heavy hybrid fixes it to `100.0%`
- `Memory store flow`
  - generic best contracts already solve it
  - code-aware summary-heavy hybrid still leaves it at `85.7%`

This narrows the remaining frontier again:

> the next real improvement is likely query-specific or corpus-specific, not a
> broad packing or seed policy that helps every question equally

The new payload diagnostics make that even more concrete:

- `Response memory policy`
  - current best contracts already pull the **right file neighborhood**:
    - `memory/hierarchical.cr`
    - `memory/pgvector.cr`
    - `memory/external_store.cr`
    - `butler/persona.cr`
  - the summary-heavy hybrid even surfaces the `_micro_only` chunk from
    `memory/hierarchical.cr`
  - but the remaining miss is about **policy nuance**, not file choice:
    the returned rows still do not cover the full combination of
    `_micro_only`, refusal/pollution behavior, and external-storage policy
- `Streaming overlap`
  - current best contracts already pull the correct file:
    `streaming/controller.cr`
  - both summary and chunk rows surface the overlap/chunking topic
  - the remaining miss is about **exact constants / same-file granularity**:
    the query still does not close the final `1500` / `100` coverage gap

So for these two stubborn real prompts, the problem has narrowed from
"retrieval picked the wrong files" to a much smaller statement:

> the current system is usually choosing the right file region, but not yet the
> exact evidence fragment or policy detail needed to close the benchmark

### Same-file local chunk refinement does not rescue the hard prompts

The next bounded hypothesis was:

> if the right file is already selected, maybe the fix is simply to give the
> best file two nearby chunks instead of one

That was tested with a new `prompt_summary_chunk_local2_in` case:

- keep the summary-heavy contract
- keep mixed ANN seeds
- replace the single best chunk from the top file with a 2-chunk local window
  around the best chunk anchor

It did **not** help.

Targeted hard-prompt rerun (`Response memory policy` + `Streaming overlap`,
`ann_k=16`, `top_k=4`, fresh backend):

- generic mode:
  - existing summary-heavy hybrid:
    - `70.0%`
    - `0.945-1.050 ms`
  - local 2-chunk refinement:
    - `70.0%`
    - `1.660-1.729 ms`
- code-aware mode:
  - existing summary-heavy hybrid:
    - `60.0%`
    - `1.041-1.078 ms`
  - local 2-chunk refinement:
    - `60.0%`
    - `1.689-1.733 ms`

Bounded all-question rerun (`40` files, `840` rows, `6` real questions,
`3` runs):

- generic mode:
  - existing summary-heavy hybrid:
    - `0.988 ms`
    - `86.7%`
    - `50.0%`
  - local 2-chunk refinement:
    - `1.572 ms`
    - `84.3%`
    - `33.3%`
- code-aware mode:
  - existing summary-heavy hybrid:
    - `0.979 ms`
    - `84.3%`
    - `50.0%`
  - local 2-chunk refinement:
    - `1.523 ms`
    - `84.3%`
    - `50.0%`

So the next frontier is narrower again:

> the missing quality is not solved by a simple "take one more nearby chunk"
> policy; the remaining problem is finer-grained evidence choice, not just a
> larger same-file window

### Semantic chunk selection is a generic-mode win, but not a universal one

The next bounded question was different from the failed local-window branch:

> maybe the chunk budget is fine, and the real problem is that the last chunk is
> being picked with the wrong scoring rule

That was tested with `prompt_summary_chunk_semantic_s3_in`:

- keep the current summary-heavy mixed-seed contract
- keep the same `3 summaries + 1 chunk` budget
- change only the final chunk selection:
  - old path: lexical-first within the top file
  - new path: semantic-distance-first within the top file

Hard-prompt rerun (`Response memory policy` + `Streaming overlap`,
fresh backend, `ann_k=16`, `top_k=4`):

- generic mode:
  - old summary-heavy hybrid:
    - `70.0%`
    - `0.992 ms`
  - semantic chunk selection:
    - `70.0%`
    - `0.481 ms`
- code-aware mode:
  - old summary-heavy hybrid:
    - `60.0%`
    - `1.035 ms`
  - semantic chunk selection:
    - `60.0%`
    - `0.429 ms`

Full `6`-question rerun (`40` files, `840` rows, `3` runs):

- generic mode:
  - old summary-heavy hybrid:
    - `0.976 ms`
    - `86.7%`
    - `50.0%`
  - semantic chunk selection:
    - `0.453 ms`
    - `86.7%`
    - `50.0%`
- code-aware mode:
  - old summary-heavy hybrid:
    - `0.975 ms`
    - `84.3%`
    - `50.0%`
  - semantic chunk selection:
    - `0.474 ms`
    - `77.6%`
    - `33.3%`

This creates a new mode-specific frontier:

- **generic mode**
  - `prompt_summary_chunk_semantic_s3_in` is now the stronger
    coverage-preserving hybrid
  - it keeps the same aggregate quality as the old summary-heavy hybrid while
    cutting latency by roughly half
- **code-aware mode**
  - the same semantic swap is not acceptable
  - it buys latency, but loses both coverage and full hits

So the next branch should treat the two embedding modes separately instead of
assuming one chunk-selection rule can dominate both.

### Prompt-focused file-local snippet extraction

The next successful branch stopped changing retrieval at all.

Instead of asking the SQL layer to return better rows, it asked a narrower
question:

> if `prompt_summary_rerank_in` already selects the right files, can we extract
> better evidence fragments from those files after retrieval?

That is now implemented in the real code-corpus harness as:

- `prompt_summary_snippet_py`

Contract:

- keep the existing `prompt_summary_rerank_in` SQL seed/output path
- for each returned summary row, resolve the underlying source file
- extract a prompt-focused snippet from the full file using:
  - prompt-term matching against code-aware line tokens
  - coverage-greedy anchor selection with method-definition tie-breaks
  - Crystal method-body expansion instead of fixed-radius windows for selected
    `def` anchors
  - adjacent helper-method merge for short `?` helpers referenced by the
    selected method body
  - nearby config-initializer merge for short ivar-based helpers, so snippets
    keep concrete defaults like `window_size=1500` / `overlap=100`
  - append the snippet to the original summary payload instead of replacing the
    summary row
- cache `(file, prompt)` snippets in-process so repeated runs are measured in
  both cold and warm regimes

This is a **downstream evidence-selection layer**, not a new PostgreSQL query
primitive. The main value is answer quality on the real `cogniformerus`
CrossFile benchmark.

Verified local result on the stable real code-corpus point (`40` files,
`840` rows, `6` questions, `384D`, `ann_k=16`, `top_k=4`, fresh backend):

- generic embedding mode:
  - `prompt_summary_rerank_in`:
    - `p50 0.343-0.392 ms`
    - `73.3%` keyword coverage
    - `50.0%` full hits
  - `prompt_summary_snippet_py`:
    - warm-cache `p50 0.551-0.698 ms`
    - cold first-pass `p50 15.316 ms`, `avg 15.435 ms`
    - `100.0%` keyword coverage
    - `100.0%` full hits
- code-aware embedding mode:
  - `prompt_summary_rerank_in`:
    - `p50 0.395-0.398 ms`
    - `77.6%`
    - `33.3%`
  - `prompt_summary_snippet_py`:
    - warm-cache `p50 0.623 ms`
    - `97.6%`
    - `83.3%`
  - `prompt_symbol_summary_snippet_py`:
    - warm-cache `p50 0.970-0.989 ms`
    - `100.0%`
    - `100.0%`

Per-question generic rerun on the same corpus shows what the snippet layer
actually fixed:

- now solved at `100.0%`:
  - `Butler response routing`
  - `Memory store flow`
  - `Two-stage answering`
  - `NLU hybrid classification`
  - `Response memory policy`
  - `Streaming overlap`

Per-question code-aware rerun with `prompt_symbol_summary_snippet_py` now also
solves the full set at `100.0%`, including the old remaining miss:

- `Memory store flow`

Interpretation:

- the remaining plateau on this real code corpus was **not** primarily file
  retrieval
  - it was a file-local evidence selection problem
- the last code-aware miss turned out to be a **seed-ranking problem inside the
  summary path**, not a snippet-window problem
  - `HierarchicalMemory` was already present in the summary row for
    `memory/hierarchical.cr`
  - the fix was a bounded symbol-aware variant,
    `prompt_symbol_summary_snippet_py`, which:
    - extracts exact prompt symbols like `HierarchicalMemory`,
      `TwoStageAnswerer`, `DialogueNLU`
    - unions a tiny exact-symbol summary seed set with the existing ANN seeds
    - ranks summary rows by `symbol_hits` before the older prompt-term score
- the strongest fix was not "wider windows"
  - it was preserving summary rows while adding code-structured snippets
    underneath them
- prompt-focused snippet extraction is the first branch that moves the real
  code-corpus benchmark from `50.0%` to `100.0%` full hits at the same
  tiny-budget `top_k=4`
- the current frontier is now split by embedding mode:
  - generic: `prompt_summary_snippet_py` remains the better latency point
  - code-aware: `prompt_symbol_summary_snippet_py` is the quality winner

Important caveat:

- the warm numbers rely on an in-process `(file, prompt)` snippet cache
- the cold first-pass cost is still materially higher than pure SQL rerank
- so this is a quality-oriented contract, not a free latency win
- the symbol-aware variant is not a generic improvement:
  - in generic mode it gives no quality lift and only adds cost

That code-corpus frontier is now also checked under a repeated-build protocol:

- [`scripts/repeat_graph_rag_code_corpus_builds.py`](../scripts/repeat_graph_rag_code_corpus_builds.py)
- `3` independent fresh temp-cluster builds
- local `facts_sh` only, same stable point:
  - `384D`
  - `ann_k=16`
  - `top_k=4`
  - `ef_search=64`
  - `ef_construction=200`
  - `m=24`
  - fresh backend

Verified repeated-build result:

- generic:
  - `prompt_summary_snippet_py`
    - `p50 median 0.613 ms`, range `0.543-0.632 ms`
    - stable `100.0% / 100.0%`
  - `prompt_symbol_summary_snippet_py`
    - `p50 median 0.986 ms`, range `0.932-1.047 ms`
    - same `100.0% / 100.0%`
    - therefore strictly slower on the generic frontier
- code-aware:
  - `prompt_summary_snippet_py`
    - `p50 median 0.612 ms`, range `0.602-0.629 ms`
    - stable `97.6% / 83.3%`
  - `prompt_symbol_summary_snippet_py`
    - `p50 median 0.963 ms`, range `0.928-1.022 ms`
    - stable `100.0% / 100.0%`

Interpretation:

- the new symbol-aware code-aware win is **build-stable**, not a one-off lucky
  HNSW construction
- the generic frontier is also build-stable, and the symbol-aware case remains
  dominated there

That same repeated-build protocol was then rerun on an AWS ARM64 host
(`4 vCPU`, `8 GiB RAM`) using:

- [`scripts/repeat_graph_rag_code_corpus_builds_aws.sh`](../scripts/repeat_graph_rag_code_corpus_builds_aws.sh)
- the same `3` fresh builds
- the same minimal synced `cogniformerus` source tree and
  `butler_code_test.cr` prompt set

Verified AWS repeated-build result:

- generic:
  - `prompt_summary_snippet_py`
    - `p50 median 0.955 ms`, range `0.954-0.960 ms`
    - stable `100.0% / 100.0%`
  - `prompt_symbol_summary_snippet_py`
    - `p50 median 1.485 ms`, range `1.473-1.487 ms`
    - same `100.0% / 100.0%`
    - still strictly slower on the generic frontier
- code-aware:
  - `prompt_summary_snippet_py`
    - `p50 median 1.008 ms`, range `1.008-1.009 ms`
    - stable `97.6% / 83.3%`
  - `prompt_symbol_summary_snippet_py`
    - `p50 median 1.541 ms`, range `1.537-1.557 ms`
    - stable `100.0% / 100.0%`

So the code-aware split is now **cross-environment verified**:

- generic keeps the older snippet contract
- code-aware keeps the symbol-aware snippet contract
- the change in winner is not a local Apple-only artifact

## Larger in-repo `cogniformerus` transfer gate

The previous repeated-build result used the smaller synced
`cogniformerus/src/cogniformerus` slice (`40` files, `840` rows after summary +
chunk expansion). That was a good stable benchmark, but it was still fair to
ask whether the contract would survive a materially larger in-repo code corpus.

The next bounded adversary check therefore reran the same repeated-build
protocol on the full `cogniformerus` repository:

- source tree: `~/Projects/Crystal/cogniformerus`
- file count: `183` Crystal files
- prompt set: the same real `butler_code_test.cr` CrossFile prompts
- same ANN knobs:
  - `384D`
  - `ann_k=16`
  - `ef_search=64`
  - `ef_construction=200`
  - `m=24`

The old tiny-budget point (`top_k=4`) did **not** transfer cleanly:

- generic `prompt_summary_snippet_py`
  - `p50 0.770 ms`
  - `87.1%` keyword coverage
  - `66.7%` full hits
  - `avg_rows 3.67`
- code-aware `prompt_symbol_summary_snippet_py`
  - `p50 1.824 ms`
  - `87.6%` keyword coverage
  - `66.7%` full hits
  - `avg_rows 4.00`

That is a real transfer gap, but it is **not** the same kind of failure as the
external `folding/src` miss. The next bounded hypothesis was simply to raise
the final result budget while keeping the same seed contract and the same
winner cases.

At `top_k=8`, `3` fresh builds gave:

- generic `prompt_summary_snippet_py`
  - `p50 median 0.819 ms`, range `0.794-0.855 ms`
  - stable `100.0% / 100.0%`
  - `avg_rows 6.33`
- code-aware `prompt_symbol_summary_snippet_py`
  - `p50 median 1.814 ms`, range `1.669-2.101 ms`
  - stable `100.0% / 100.0%`
  - `avg_rows 7.50`

So the larger in-repo Crystal-side transfer gate is now verified.

The honest correction is:

- the current real code-corpus winners are **not** universal at the old
  `top_k=4` budget
- on the full in-repo corpus, they need a slightly larger final result budget
- once that budget moves to `top_k=8`, the current winners recover perfectly
  without needing a new seed or snippet contract

That narrows the remaining `0.13` real-corpus gap further:

- `~/Projects/Crystal` now has both the small stable slice and a larger
  full-repo transfer gate
- the next unverified generalization work was the mixed-language /
  archive side (`~/Projects/C`, `~/SrcArchives`)

## Mixed-language `~/Projects/C` adversary gate (`pycdc`)

The next release-hardening branch widened the code-corpus harness itself:

- JSON question fixtures are now supported
- source discovery is no longer hardcoded to `*.cr`
- local dependency edges now also understand quoted C/C++ includes:
  `#include "..."` -> `REQUIRES_FILE`

That made it possible to run the same narrow code-GraphRAG benchmark shape on a
real mixed-language corpus under `~/Projects/C` without inventing a separate
harness family.

The first such corpus was `pycdc`:

- source tree: `~/Projects/C/pycdc`
- fixture: `scripts/fixtures/graph_rag_pycdc_questions.json`
- source extensions:
  - `.h`
  - `.cpp`
  - `.txt`
  - `.markdown`
- corpus size:
  - `138` files
  - `1281` rows after summary + chunk expansion
  - `72` local dependency edges from quoted includes

The first smoke run already gave the key split:

- generic `prompt_summary_snippet_py`
  - `75.0%` keyword coverage
  - `40.0%` full hits
- generic `prompt_symbol_summary_snippet_py`
  - `90.0%`
  - `60.0%`
- code-aware `prompt_summary_snippet_py`
  - `70.0%`
  - `60.0%`
- code-aware `prompt_compactseed_require_summary_snippet_fn`
  - `100.0%`
  - `100.0%`

That already falsified the lazy story that mixed-language transfer would look
just like the Crystal corpora with only file-summary rerank. On `pycdc`, the
include-aware rescue path matters much more.

Repeated-build verification at `top_k=8`, `3` fresh builds, then gave:

- generic `prompt_symbol_summary_snippet_py`
  - `p50 median 0.850 ms`, range `0.825-1.118 ms`
  - stable `90.0% / 60.0%`
  - `avg_rows 6.40`
- code-aware `prompt_compactseed_require_summary_snippet_fn`
  - `p50 median 8.006 ms`, range `7.799-8.136 ms`
  - stable `100.0% / 100.0%`
  - `avg_rows 5.80`

So the first real `~/Projects/C` gate is now covered, but it does **not**
produce the same frontier as the Crystal corpora:

- there is no equally cheap generic `100.0% / 100.0%` point here
- the quality-complete point currently needs the slower helper-backed compact
  lexical seed + include rescue

That still narrows the `0.13` release gap meaningfully:

- `~/Projects/Crystal` is covered
- `~/Projects/C` is covered
- the remaining unverified archive-side gate is now `~/SrcArchives`

## Archive-side `~/SrcArchives` gate (`ninja/src`)

The last remaining real-corpus gap named in the `0.13` plan was the archive
side under `~/SrcArchives`. The new mixed-language harness path made it
possible to cover that without another code change, so the next adversary
corpus was:

- source tree: `~/SrcArchives/apple/ninja/src`
- fixture: `scripts/fixtures/graph_rag_ninja_questions.json`
- source extensions:
  - `.h`
  - `.cc`
- corpus size:
  - `103` files
  - `1757` rows after summary + chunk expansion
  - `282` local dependency edges from quoted includes

The first smoke at the current default-ish budget (`top_k=8`) already gave a
useful signal:

- generic `prompt_summary_snippet_py`
  - `95.0%` keyword coverage
  - `80.0%` full hits
- code-aware `prompt_summary_snippet_py`
  - `85.0%`
  - `80.0%`

That differed from `pycdc` in an important way:

- the archive corpus was already close on the plain generic path
- the code-aware path was not stronger here
- there was no immediate evidence that a dependency-rescue branch was needed

The cheapest falsifier was therefore not a new query contract, but just a small
increase in the final result budget. At `top_k=12`:

- generic `prompt_summary_snippet_py`
  - `100.0% / 100.0%`
  - `p50 0.996 ms` on the first smoke
- code-aware `prompt_summary_snippet_py`
  - stayed at `85.0% / 80.0%`

Repeated-build verification (`3` fresh builds) then confirmed the archive-side
winner:

- generic `prompt_summary_snippet_py`
  - `p50 median 0.914 ms`, range `0.827-0.921 ms`
  - stable `100.0% / 100.0%`
  - `avg_rows 7.80`
- code-aware `prompt_summary_snippet_py`
  - `p50 median 0.871 ms`, range `0.848-0.901 ms`
  - stable `85.0% / 80.0%`
  - `avg_rows 7.60`

So the archive-side gate is now covered, and the conclusion is pleasantly
narrow:

- `~/SrcArchives` does not require a new rescue contract for the first verified
  corpus
- the simple generic summary-snippet path closes `ninja/src`
- the only change needed versus the smaller code-corpus points was a small
  result-budget bump from `top_k=8` to `top_k=12`

This means the `0.13` larger real-corpus verification matrix is now complete in
the scoped sense the plan asked for:

- `~/Projects/Crystal`
- `~/Projects/C`
- `~/SrcArchives`

## External folding corpus check

The next adversary check was a second real code corpus outside this repository:

- source tree: `folding/src`
- prompt set: `butler_folding_test.cr`

This surfaced one real harness bug first:

- [`scripts/bench_graph_rag_code_corpus.py`](../scripts/bench_graph_rag_code_corpus.py)
  originally globbed `*.cr` paths without filtering `is_file()`
- on the `folding` tree that accidentally picked up `.crystal-cache` directories
  ending in `.cr`
- the harness now filters to real files only

Once that was fixed, the external corpus produced a useful repeated-build
result. Local `3`-build protocol on `facts_sh`, generic mode, same small-budget
point (`384D`, `ann_k=16`, `top_k=4`, `ef_search=64`, `ef_construction=200`,
`m=24`, fresh backend):

- `prompt_summary_snippet_py`
  - `p50 median 1.048 ms`, range `0.913-4.141 ms`
  - quality drifted across fresh builds: `90.5-100.0%` keyword coverage,
    `83.3-100.0%` full hits
- `prompt_lexseed_require_summary_snippet_fn`
  - the first non-oracle rescue to `100.0% / 100.0%`
  - but under a colder repeated-build protocol it turned out to be much more
    expensive than the earlier one-build numbers suggested:
    `p50 median 28.266 ms`, range `26.887-30.698`
- `prompt_compactseed_require_summary_snippet_fn`
  - `p50 median 5.940 ms`, range `5.914-6.128`
  - stable `100.0% / 100.0%`
- `oracle_prompt_summary_snippet_py`
  - on a bounded full rerun it also stayed at `100.0% / 100.0%`, but the
    non-oracle compact-seed rescue already matches that quality, so oracle
    seeds are no longer the interesting external-generic diagnostic

Interpretation:

- the old claim that generic external folding was already solved by
  `prompt_summary_snippet_py` was too strong
- the generic baseline is now clearly less robust on this corpus than on the
  in-repo `cogniformerus` slice
- the first full-summary lexical rescue proved that the external gap was
  solvable, but it was too expensive to be a real frontier
- the stronger branch was a **different lexical-seed representation**:
  a compact per-file seed table built from file path terms, require-target
  terms, and deduplicated summary tokens
- that compact-seed rescue still closes the quality gap to `100.0% / 100.0%`,
  but cuts the old full-summary lexical rescue by about `4.8x` locally

An isolated timing split then narrowed where that penalty actually sits. On a
fresh local `3`-run sweep of only the old full-summary helper-backed rescue:

- generic `prompt_lexseed_require_summary_snippet_fn`
  - `avg fetch ms/query = 10.674`
  - `avg postprocess ms/query = 8.033`
  - `24` snippet-cache misses, `48` hits
  - `avg build time per miss = 6.010 ms`
- code-aware `prompt_lexseed_require_summary_snippet_fn`
  - `avg fetch ms/query = 11.016`
  - `avg postprocess ms/query = 7.742`
  - `24` snippet-cache misses, `48` hits
  - `avg build time per miss = 5.787 ms`

So the external rescue is **not** primarily a snippet-extraction problem.
Even on the isolated cold pass, the dominant term is still the lexical-seed +
`REQUIRES_FILE` fetch path. Snippet generation is a real secondary tax on the
first pass, but it is not where the largest win now sits.

A kept-temp-cluster component probe narrowed that one step further. On the
same external `folding/src` corpus:

- `ann` alone was cheap: about `0.51 ms` median across the 6 real prompts
- `lexical_seed` alone was the real dominant stage: about `9.34 ms` median
- `rescue_require` landed at about `9.28 ms` median because it inherits the
  same lexical-seed cost
- `rescue_lexical_require_summaries` was about `9.86 ms` median

The summary rows explain why this stage is expensive: `REL_FILE_SUMMARY`
payload length was `80 / 2078 / 5441` bytes at `min / median / max` on the
external corpus. So the rescue is paying to run prompt-term substring scoring
against multi-kilobyte summary payloads even before snippet extraction starts.

The same external `folding/src` corpus also answered the code-aware question.
At the same repeated-build point:

- code-aware `prompt_summary_snippet_py`
  - `p50 median 1.080 ms`, range `1.048-1.146 ms`
  - stable `79.8% / 66.7%`
- code-aware `prompt_lexseed_require_summary_snippet_fn`
  - `p50 median 36.676 ms`, range `29.806-40.705`
  - stable `100.0% / 100.0%`
- code-aware `prompt_compactseed_require_summary_snippet_fn`
  - `p50 median 5.804 ms`, range `5.776-6.510`
  - stable `100.0% / 100.0%`
- code-aware `oracle_prompt_summary_snippet_py`
  - `p50 median 1.217 ms`, range `1.149-1.303 ms`
  - stable `100.0% / 100.0%`

So the external folding split is now sharper:

- both **generic** and **code-aware** external folding now have a verified
  non-oracle rescue to `100.0% / 100.0%`
- the external problem really was a **seed-representation problem**, not a
  snippet extraction problem
- the current external default is the compact-seed rescue, not the old
  full-summary lexical rescue
- the old full-summary rescue is now useful mainly as a diagnostic anchor for
  why the compact representation matters
- the honest conclusion is narrower:
  - external folding is no longer blocked by an unsolved quality gap
  - it still pays a quality/latency tax relative to the primary
    `cogniformerus` code corpus, but that tax is now much smaller than before

That local result also transferred to AWS ARM64 (`4 vCPU`, `8 GiB RAM`) under a
fresh `3`-build repeated-build protocol:

- generic `prompt_summary_snippet_py`
  - `p50 median 1.540 ms`, range `1.535-1.604 ms`
  - stable `90.5% / 83.3%`
- generic `prompt_lexseed_require_summary_snippet_fn`
  - `p50 median 41.960 ms`, range `41.747-42.081`
  - stable `100.0% / 100.0%`
- generic `prompt_compactseed_require_summary_snippet_fn`
  - `p50 median 8.839 ms`, range `8.732-8.846`
  - stable `100.0% / 100.0%`
- code-aware `prompt_summary_snippet_py`
  - `p50 median 1.775 ms`, range `1.729-1.836 ms`
  - stable `79.8% / 66.7%`
- code-aware `prompt_lexseed_require_summary_snippet_fn`
  - `p50 median 60.413 ms`, range `60.298-60.660`
  - stable `100.0% / 100.0%`
- code-aware `prompt_compactseed_require_summary_snippet_fn`
  - `p50 median 8.392 ms`, range `8.329-8.413`
  - stable `100.0% / 100.0%`

So the compact-seed external rescue is now **cross-environment verified**, not
a local artifact. The speedup over the old full-summary lexical rescue also
survives the environment change:

- generic: `41.960 ms -> 8.839 ms`
- code-aware: `60.413 ms -> 8.392 ms`

The external rescue is still slower than the primary in-repo winners, but it is
no longer "full-quality only at tens of milliseconds".

The next honest optimization target therefore changed. Cheap seed-budget cuts
were already falsified (`ann_k < 16` and lexical-seed `LIMIT 1` both got
worse), and the timing split shows that further work should focus on reducing
the old full-summary lexical-seed cost. The compact lexical seed table already
eliminated most of that cost, so the next branch is no longer "make lexical
seeding viable at all"; it is whether the compact representation can be pushed
closer to the primary in-repo code-corpus frontier.

One obvious branch was also falsified directly: truncating lexical scoring to a
summary prefix. On the external corpus:

- `left(payload, 512)` dropped the rescue query to about `7.9 ms`, but quality
  fell back to `96.7% / 83.3%`
- `left(payload, 1024)` restored `100.0% / 100.0%`, but it no longer sped the
  query up
- the narrower threshold sweep (`640..992`) confirmed there was no useful
  middle ground: `992` bytes recovered `100.0% / 100.0%`, but was still slower
  than the full-payload rescue

So a naive prefix cut is now a documented dead end. The remaining work is not
"look at less text in the same way"; it needs a different lexical-seed
representation or a different seed-selection contract altogether.

## March 26, 2026: `sorted_hnsw.shared_cache` GraphRAG branch

A new bounded speed branch looked promising for fact-shaped GraphRAG: turning
`sorted_hnsw.shared_cache` on for the ANN seed step. A direct local probe on a
`2K x 384D` multihop graph reduced the path-aware wrapper from roughly
`0.911 ms` total to `0.623 ms`, with most of the gain in the ANN stage.

That did **not** survive the reliability gate.

On the full local `5K`-pair, `64`-query multihop harness, keeping the same
quality knobs (`ann_k=64`, `ef_search=128`, `ef_construction=200`, `m=24`)
but switching only `sorted_hnsw.shared_cache` from `off` to `on` caused all
`facts_sh` ANN-seeded rows to collapse to `0.0% / 0.0%`, while the `facts_heap`
baseline stayed correct in the same run.

The strongest evidence from this branch is:

- the simple direct ANN seed query on `facts_sh` still returned the expected
  top rows with `shared_cache=on`
- single-query GraphRAG probes could still look correct
- the failure only showed up on the **full** same-session multihop harness,
  which points to a cache lifecycle / reuse bug rather than a general GraphRAG
  scoring bug

So the current honest conclusion is narrow:

- `sorted_hnsw.shared_cache = on` remains a **promising** performance branch for
  GraphRAG seed scans
- it is **not** currently safe as the default GraphRAG benchmark or release
  operating point
- the benchmark harnesses now expose a `--shared-cache on|off` switch, but the
  default stays `off` until this correctness issue is debugged and fixed