# GraphRAG on `sorted_heap` This note evaluates a narrow question: > Can current `sorted_heap` + current vector search already support a useful > GraphRAG-style retrieval workflow, or do we need a new storage/model layer? The conclusion so far is: - **1-hop fact retrieval by source entity already fits `sorted_heap` well.** - **Naive SQL join-based multi-hop expansion does not expose much advantage.** - **`ANY(array_of_seed_ids)` expansion does trigger `SortedHeapScan`, but on warm and medium-scale local benchmarks it still loses to heap+btree on end-to-end latency despite reading fewer blocks.** - Narrow C helpers for expansion and fused top-K rerank now exist as: - `sorted_heap_expand_ids(...)` - `sorted_heap_expand_rerank(...)` - A one-call convenience wrapper now exists as: - `sorted_heap_graph_rag_scan(...)` - Those helpers materially improve the `sorted_heap` path on the synthetic GraphRAG benchmark, though pure heap+btree expansion is still faster on this synthetic workload. - Therefore the next promising primitive was correctly **a narrow C helper**, not a new graph storage engine and not a giant monolithic `graph_rag_scan()` API. ## Existing anchors The repository already has the main building blocks: 1. **Zone-map pruning on `sorted_heap`** - planner hook + `SortedHeapScan` custom scan - supports base-relation restriction on the leading PK columns 2. **Planner-integrated ANN via `sorted_hnsw`** - exact ordered results - works on both heap tables and `sorted_heap` tables 3. **Legacy graph traversal precedent** - `svec_graph_scan()` in `pq.c` - this is for ANN sidecar graph navigation, not fact graphs - still useful as evidence that the extension can host graph-like traversal logic in C ## What was benchmarked Synthetic fact graph schema: ```sql CREATE TABLE facts_heap ( entity_id int4 NOT NULL, relation_id int2 NOT NULL, target_id int4 NOT NULL, embedding svec(32) NOT NULL, payload text NOT NULL, PRIMARY KEY (entity_id, relation_id, target_id) ); CREATE TABLE facts_sh ( entity_id int4 NOT NULL, relation_id int2 NOT NULL, target_id int4 NOT NULL, embedding svec(32) NOT NULL, payload text NOT NULL, PRIMARY KEY (entity_id, relation_id, target_id) ) USING sorted_heap; ``` Both tables also receive the same ANN index: ```sql CREATE INDEX ... USING sorted_hnsw (embedding) WITH (m = 16, ef_construction = 64); ``` Benchmark harness: - [`scripts/bench_graph_rag.py`](../scripts/bench_graph_rag.py) - local ephemeral PostgreSQL 18 temp cluster - deterministic synthetic fact graph - compares: - `hop1_entity` - `hop1_entity_relation` - `hop2_join` - `hop2_in` - `seed_expand_join` - `seed_expand_in` - `seed_expand_rerank_join` - `seed_expand_rerank_in` - `seed_expand_fn` - `seed_expand_rerank_fn` - `seed_expand_rerank_topk_fn` - `seed_graph_rag_scan_fn` The key comparison is between: - **join-shaped expansion** - **`ANY(array(seed_ids))` expansion** The second shape is the one that allows `sorted_heap` to expose its pruning logic directly on `entity_id`. ## Local findings ### Small smoke run On a tiny graph (`300` entities, `4` edges/entity): - `facts_sh` reduced buffer hits strongly for: - `hop1_entity` - `hop1_entity_relation` - `hop2_in` - `seed_expand_in` - but end-to-end latency stayed close to heap because the whole dataset was fully warm and tiny Most importantly: - **join-shaped expansion largely erased the `sorted_heap` advantage** - **`ANY(array(...))` expansion preserved `SortedHeapScan`** ### Medium warm run On `20K` entities, `8` edges/entity (`160K` rows total), warm local cache: - `hop1_entity` - heap: `Index Scan` - sorted_heap: `Custom Scan:SortedHeapScan` - sorted_heap reads fewer blocks and is roughly latency-parity - `seed_expand_join` - bad shape for both - sorted_heap is not meaningfully better - `seed_expand_in` - sorted_heap does use `SortedHeapScan` - buffer footprint drops - but **heap+btree still wins on total latency** This means: > current SQL shape can make `sorted_heap` read less, but executor/custom-scan > overhead can still dominate the total time on warm-medium datasets ### Medium run with lower shared buffers On `20K` entities, `16` edges/entity (`320K` rows total), `shared_buffers=64MB`: - `hop1_entity` - sorted_heap stayed strong: fewer hits, same-or-better latency - `seed_expand_join` - both paths were much worse - heap and sorted_heap were similar, with read noise dominating - `seed_expand_in` - heap: lower latency - sorted_heap: fewer touched blocks / lower expansion footprint - but **still slower end-to-end** This is the most important current result: > On a graph larger than a warm toy dataset, `sorted_heap` already shows the > expected locality/pruning behavior for seed expansion, but the current > SQL + `CustomScan` path is not enough to turn that into a consistent latency > win over heap+btree. ## Design implications ### What not to build first 1. **Not a new graph storage engine** - current evidence does not justify that jump - 1-hop retrieval is already good on current storage 2. **Not a giant monolithic `svec_graph_rag_scan()`** - it would have to combine: - ANN seed retrieval - graph expansion - rerank - this is a large surface area - it also risks duplicating planner/index logic from `sorted_hnsw` ### What to build next The next narrow primitive should be something like: ```sql sorted_heap_expand_ids( rel regclass, seed_ids int4[], relation_filter int2 DEFAULT NULL, limit_rows int4 DEFAULT 0 ) ``` Why this shape: - ANN seed retrieval can stay in SQL: - `SELECT target_id FROM facts ORDER BY embedding <=> $query LIMIT K` - expansion becomes a dedicated low-overhead C primitive - it avoids: - repeated executor/planner setup - generic `CustomScan` overhead for this narrow use case - it keeps the product boundary small: - “expand these known entity IDs quickly” That primitive can later be composed into: 1. SQL-only GraphRAG 2. a higher-level helper 3. maybe a monolithic API if the narrow primitive proves valuable ## Helper result The narrow helpers now exist: ```sql sorted_heap_expand_ids( rel regclass, seed_ids int4[], relation_filter int4 DEFAULT NULL, limit_rows int4 DEFAULT 0 ) RETURNS TABLE ( entity_id int4, relation_id int2, target_id int4, embedding svec, payload text ) ``` and: ```sql sorted_heap_expand_rerank( rel regclass, seed_ids int4[], query svec, top_k int4, relation_filter int4 DEFAULT NULL, limit_rows int4 DEFAULT 0 ) RETURNS TABLE ( entity_id int4, relation_id int2, target_id int4, payload text, distance float8 ) ``` and: ```sql sorted_heap_expand_twohop_rerank( rel regclass, seed_ids int4[], query svec, top_k int4, hop1_relation_filter int4 DEFAULT NULL, hop2_relation_filter int4 DEFAULT NULL, limit_rows int4 DEFAULT 0 ) RETURNS TABLE ( entity_id int4, relation_id int2, target_id int4, payload text, distance float8 ) ``` and: ```sql sorted_heap_graph_rag_scan( rel regclass, query svec, ann_k int4, top_k int4, relation_filter int4 DEFAULT NULL, limit_rows int4 DEFAULT 0 ) RETURNS TABLE ( entity_id int4, relation_id int2, target_id int4, payload text, distance float8 ) ``` Their current contract is intentionally narrow: - relation must be a `sorted_heap` table - relation must expose the columns: - `entity_id int4` - `relation_id int2` - `target_id int4` - `embedding svec` - `payload text` - the function reuses the zone-map range builder directly - it emits fact rows for known source entity IDs On the medium-pressure benchmark (`20K` entities, `16` edges/entity, `320K` rows, `shared_buffers=64MB`, fresh backend, `runs=3`), the helpers produced: - `facts_heap seed_expand_in`: `0.123 ms` - `facts_sh seed_expand_in`: `0.285 ms` - `facts_sh seed_expand_fn`: `0.165 ms` - `facts_sh seed_expand_rerank_in`: `0.369 ms` - `facts_sh seed_expand_rerank_fn`: `0.234 ms` - `facts_sh seed_expand_rerank_topk_fn`: `0.139 ms` - `facts_sh seed_graph_rag_scan_fn`: `0.144 ms` Interpretation: - `sorted_heap_expand_ids()` converts the observed block-pruning/locality advantage into a **real latency win over the current SQL + CustomScan path** - `sorted_heap_expand_rerank()` removes most of the remaining rerank overhead and is now materially faster than the current `sorted_heap` SQL rerank path (`0.139 ms` vs `0.369 ms`) - `sorted_heap_graph_rag_scan()` is only slightly slower than the direct fused helper composition (`0.144 ms` vs `0.139 ms`), so the convenience API does not erase the win - pure heap+btree expansion is still faster on this synthetic workload (`0.123 ms` vs `0.165 ms`) Relation-filtered probes narrow that gap further: - `facts_heap seed_expand_rel_in`: `0.074 ms` - `facts_sh seed_expand_rel_in`: `0.151 ms` - `facts_sh seed_expand_rel_fn`: `0.108 ms` - `facts_heap seed_expand_rerank_rel_in`: `0.087 ms` - `facts_sh seed_expand_rerank_rel_in`: `0.167 ms` - `facts_sh seed_expand_rerank_rel_topk_fn`: `0.104 ms` - `facts_sh seed_graph_rag_rel_scan_fn`: `0.120 ms` So the relation-filtered GraphRAG path is materially better than the current SQL + `CustomScan` form, but it still does not clearly beat heap+btree on this synthetic corpus. The filtered helper path is nevertheless close enough that a real fact graph, wider payloads, or colder cache state may flip the comparison. Payload-width sensitivity does matter, but not monotonically. The benchmark harness now supports `--payload-bytes` to widen synthetic fact rows and test the claim that locality should matter more once facts stop being tiny strings. On the same medium-pressure setup (`20K` entities, degree `16`, `320K` rows, `shared_buffers=64MB`, fresh backend): - with `payload_bytes=1024` - `facts_heap seed_expand_in`: `0.188 ms` - `facts_sh seed_expand_in`: `0.185 ms` - `facts_heap seed_expand_rerank_rel_in`: `0.120 ms` - `facts_sh seed_expand_rerank_rel_topk_fn`: `0.100 ms` - `facts_sh seed_graph_rag_rel_scan_fn`: `0.125 ms` - with `payload_bytes=2048` - `facts_heap seed_expand_in`: `0.113 ms` - `facts_sh seed_expand_in`: `0.208 ms` - `facts_heap seed_expand_rerank_rel_in`: `0.090 ms` - `facts_sh seed_expand_rerank_rel_topk_fn`: `0.122 ms` - `facts_sh seed_graph_rag_rel_scan_fn`: `0.127 ms` Interpretation: - a wider inline payload can make `sorted_heap` competitive or slightly better on seed expansion - but the effect is not monotonic, so "wider payload always helps sorted_heap" is false on this synthetic generator - this synthetic text filler is still a weak proxy for real fact payloads because compression/TOAST behavior can change the balance again So the next falsifier should be a real-dataset GraphRAG harness or a more realistic payload model, not another synthetic-only extrapolation. ## Real-text Gutenberg graph A better falsifier now exists in: - [`scripts/bench_graph_rag_gutenberg.py`](../scripts/bench_graph_rag_gutenberg.py) This harness uses real Gutenberg paragraphs instead of synthetic payload text. It builds a small text graph: - relation `1`: `book -> paragraph` (`contains`) - relation `2`: `paragraph -> next_paragraph` (`next`) Embeddings are still deterministic lexical hash vectors, not external model embeddings. That means this harness is good for measuring graph-expansion latency on real text payloads and a real graph topology, but it is not a semantic-quality benchmark. Two useful runs on `shared_buffers=64MB`, fresh backend: `64 books x 128 paragraphs/book` (`14,549` rows): - `facts_heap seed_expand_rerank_rel_in`: `0.071 ms` - `facts_sh seed_expand_rerank_rel_in`: `0.088 ms` - `facts_sh seed_expand_rerank_rel_topk_fn`: `0.061 ms` - `facts_sh seed_graph_rag_rel_scan_fn`: `0.084 ms` `128 books x 256 paragraphs/book` (`58,954` rows): - `facts_heap seed_expand_rel_in`: `0.073 ms` - `facts_sh seed_expand_rel_in`: `0.078 ms` - `facts_sh seed_expand_rel_fn`: `0.069 ms` - `facts_heap seed_expand_rerank_rel_in`: `0.079 ms` - `facts_sh seed_expand_rerank_rel_in`: `0.101 ms` - `facts_sh seed_expand_rerank_rel_topk_fn`: `0.063 ms` - `facts_sh seed_graph_rag_rel_scan_fn`: `0.089 ms` This is the first non-synthetic result that materially weakens the earlier "heap+btree simply wins" story: - the plain `sorted_heap` SQL path is still worse than heap+btree - but the fused filtered helper on the real-text Gutenberg graph is already at parity or slightly better than heap+btree on the rerank path - the one-call wrapper is close enough that its overhead is visible but not disqualifying So the narrow-helper direction survives the real-text falsifier better than the short-payload synthetic benchmark suggested. ## pgvector parity on the real-text graph The Gutenberg harness also now supports a comparable `pgvector` path on the same graph: - ANN seeds come from a `facts_pgv` table with `vector(dim)` + HNSW - graph expansion and exact rerank still happen in PostgreSQL over the fact rows, which is the relevant GraphRAG shape This is important because a pure ANN benchmark would miss the real product question: how expensive is "ANN seed + graph expansion + exact rerank" as one workflow? On fresh-backend runs with `shared_buffers=64MB`: `64 books x 128 paragraphs/book` (`14,549` rows): - heap rerank baseline: `0.064 ms` - `sorted_heap_expand_rerank(... relation=2)`: `0.060 ms` - `sorted_heap_graph_rag_scan(... relation=2)`: `0.075 ms` - `pgvector ANN -> heap expansion -> exact rerank`: `0.180 ms` `128 books x 256 paragraphs/book` (`58,954` rows): - heap rerank baseline: `0.085 ms` - `sorted_heap_expand_rerank(... relation=2)`: `0.071 ms` - `sorted_heap_graph_rag_scan(... relation=2)`: `0.087 ms` - `pgvector ANN -> heap expansion -> exact rerank`: `0.295 ms` The buffer footprint matches the latency story: - `sorted_heap` helper path stays around hundreds of shared-buffer hits - the `pgvector` path needs several thousands of shared-buffer hits before the same exact rerank step This does **not** mean `pgvector` is bad at pure ANN. It means that for this GraphRAG workload shape, once the seed stage is followed by relational graph expansion and exact rerank, the narrow `sorted_heap` helper path is materially better aligned with the whole workflow than an external ANN seed on a separate table. ## zvec parity on the real-text graph The same Gutenberg harness now also supports a comparable `zvec` path: - ANN seeds come from a temporary `zvec` HNSW collection built from the same fact rows - graph expansion and exact rerank still happen in PostgreSQL over `facts_heap` This produced a mixed but useful result. On the medium real-text slice (`64 books x 128 paragraphs/book`, `14,549` rows, fresh backend, `shared_buffers=64MB`): - heap rerank baseline: `0.068 ms` - `sorted_heap_expand_rerank(... relation=2)`: `0.066 ms` - `sorted_heap_graph_rag_scan(... relation=2)`: `0.082 ms` - `zvec ANN -> heap expansion -> exact rerank`: `0.322 ms` So on the medium slice, the `zvec` path is stable but materially slower than the fused `sorted_heap` helper. The SQL-side buffer footprint is not the bottleneck there; the external ANN seed stage dominates the total latency. On the larger real-text slice (`128 books x 256 paragraphs/book`, `58,954` rows), the result is currently **not publishable as a clean latency row**: - the `sorted_heap` helper path remains stable: - `sorted_heap_expand_rerank(... relation=2)`: `0.070 ms` - `sorted_heap_graph_rag_scan(... relation=2)`: `0.084 ms` - the `zvec` path fails during ANN seed retrieval at `ann_k=32` The failure is not coming from PostgreSQL or from the GraphRAG SQL wrapper. A pure `zvec`-only reproduction on the same `58,954`-row lexical-hash corpus shows the same failure mode: - for one probe query, `topk=8` and `topk=10` return valid document IDs - `topk>=16` returns empty `doc.id` values after: - `Failed to find target chunk for index 58379` The Gutenberg GraphRAG harness now turns that into an explicit benchmark error: - `RuntimeError: zvec returned unmapped doc ids (...)` So the objective conclusion today is narrower than for `pgvector`: - `zvec` does not currently provide a robust large-slice GraphRAG parity row on this real-text workflow at `ann_k=32` - on the medium slice where it does run, it is materially slower than the fused `sorted_heap` helper path - on the larger slice, the current blocker is `zvec` ANN seed instability, not PostgreSQL expansion/rerank overhead That instability is now isolated more sharply by the repo-owned reproducer: - [`scripts/repro_zvec_gutenberg_threshold.py`](../scripts/repro_zvec_gutenberg_threshold.py) Current threshold signature on the lexical-hash Gutenberg corpus: - `topk=16`, `dim=32` - `64x256`, `80x256`, `96x256`, `112x256` slices are stable - `28,661`, `36,064`, `43,684`, `51,166` rows - `128x256` fails - `58,954` rows - first bad probe: `query #10` - returned ids are empty strings after `Failed to find target chunk for index 58379` So the current failure signature is not just "large-ish GraphRAG benchmark". It looks more like a size-thresholded `zvec` retrieval bug on this corpus shape. That theory is now falsified by a second repo-owned reproducer on a plain synthetic FP32 corpus: - [`scripts/repro_zvec_synthetic_threshold.py`](../scripts/repro_zvec_synthetic_threshold.py) Current synthetic signature: - `dim=32`, `ef_search=64` - `topk=7` already reproduces the issue - a compact failing case exists at `4,950` rows - nearby controls: - `4,900` rows: ok - `4,950` rows: bad - `5,000` rows: bad - `topk<=6` is clean on the `4,950`-row case - failures are non-monotonic by row count - bad: `16,000`, `20,000`, `28,000`, `30,000`, `45,000`, `60,000` - ok: `24,000`, `29,000`, `75,000` (`100` probe queries still clean at `75k`) - another local non-monotonic pocket exists around `7k-8k` - `7,000`: ok - `7,500`: bad - `7,800`: ok - `7,900`: bad - representative stderr lines: - `Failed to find target chunk for index 4945` - `Failed to find target chunk for index 14999` - `Failed to find target chunk for index 29999` - `Failed to find target chunk for index 59999` So the stronger objective conclusion is: - the failure is not Gutenberg-specific - it is not a simple monotonic "too many rows" threshold either - the current evidence points to a broader `zvec` retrieval defect around forward-store / chunk lookup, not to PostgreSQL GraphRAG expansion logic For an upstream-ready summary of the current evidence, see: - [`docs/zvec-empty-id-bug.md`](./zvec-empty-id-bug.md) Two more diagnostic observations make that conclusion sharper: - when the synthetic bug triggers, the ANN scores still come back while `doc.id` is empty for the whole result set - `4,950 rows`, `topk=6`: valid ids - `4,950 rows`, `topk=7`: same score bands, but every `doc.id` is `''` - on a larger synthetic case (`16,000` rows), exact cosine inspection shows the best-score bucket spans `1000, 2000, ..., 16000`, and `zvec` already returns empty ids at `topk=5` That does not prove the internal root cause, but it strongly suggests the ANN ranking stage is still producing plausible scores while the forward-store document lookup stage is failing. A reasonable working hypothesis is that some tied-score / candidate-materialization paths touch unresolved high indexes and poison metadata resolution for the whole returned batch. ## Qdrant parity on the real-text graph The Gutenberg harness now also supports a comparable `Qdrant` path: - ANN seeds come from a local Qdrant HNSW collection built from the same fact rows - graph expansion and exact rerank still happen in PostgreSQL over `facts_heap` Unlike `zvec`, this path stayed stable on both the medium and larger real-text slices. The result is simpler: `64 books x 128 paragraphs/book` (`14,549` rows): - heap rerank baseline: `0.074 ms` - `sorted_heap_expand_rerank(... relation=2)`: `0.062 ms` - `sorted_heap_graph_rag_scan(... relation=2)`: `0.083 ms` - `Qdrant ANN -> heap expansion -> exact rerank`: `1.535 ms` `128 books x 256 paragraphs/book` (`58,954` rows): - heap rerank baseline: `0.081 ms` - `sorted_heap_expand_rerank(... relation=2)`: `0.083 ms` - `sorted_heap_graph_rag_scan(... relation=2)`: `0.085 ms` - `Qdrant ANN -> heap expansion -> exact rerank`: `1.769 ms` So on this GraphRAG workflow shape: - Qdrant is robust on the real-text benchmark - but its external ANN seed stage dominates end-to-end latency - the fused `sorted_heap` helper remains roughly an order of magnitude faster on the rerank path That again does **not** mean Qdrant is a bad vector engine in isolation. It means that when the workflow is "external ANN seed + relational graph expansion + exact rerank inside PostgreSQL", the narrow in-engine helper path is much better aligned with the total job than a remote vector service. ## Robustness rerun The same real-text Gutenberg harness was then rerun with a larger query set (`query_count=64`, `runs=3`) to check whether the earlier `16`-query results were just small-sample noise. The ranking stayed the same on both slices: - medium slice (`64 x 128`): - `sorted_heap_expand_rerank(... relation=2)`: `0.062 ms` - `sorted_heap_graph_rag_scan(... relation=2)`: `0.081 ms` - `pgvector ANN -> heap expansion -> exact rerank`: `0.219 ms` - `zvec ANN -> heap expansion -> exact rerank`: `0.342 ms` - `Qdrant ANN -> heap expansion -> exact rerank`: `1.567 ms` - larger slice (`128 x 256`): - `sorted_heap_expand_rerank(... relation=2)`: `0.067 ms` - `sorted_heap_graph_rag_scan(... relation=2)`: `0.088 ms` - `pgvector ANN -> heap expansion -> exact rerank`: `0.309 ms` - `Qdrant ANN -> heap expansion -> exact rerank`: `1.911 ms` - `zvec` remains excluded from this large-slice rerun because the previously observed `ann_k=32` instability is still the blocker So the current GraphRAG conclusion is no longer resting on one short probe set. At least on this real-text Gutenberg workflow, the fused `sorted_heap` helper still has the best end-to-end latency profile after the query set is expanded. ## Two-hop Gutenberg composition The next adversarial question was whether the current helper story survives a real **two-hop** workflow, not just the earlier "ANN seeds -> one filtered expansion -> rerank" shape. The initial Gutenberg falsifier first used a composed path from the existing narrow primitives: 1. ANN seeds from the fact table 2. first hop via `sorted_heap_expand_ids(..., relation=2)` 3. second hop via `sorted_heap_expand_rerank(..., relation=2)` That composition benchmark was intentionally a harsher test than the earlier one-hop helper story, because it asked whether the current primitives were already enough to make multi-hop GraphRAG plausible before inventing a dedicated two-hop helper. The answer was "yes, barely enough". That justified one narrow extra helper, not a new storage engine: ```sql sorted_heap_expand_twohop_rerank(...) ``` This fused helper keeps the same contract shape as the earlier rerank helper, but removes the intermediate SQL/materialization boundary between hop1 and the second-hop rerank. On the medium real-text slice (`64 books x 128 paragraphs/book`, `14,549` rows, `32D`, `query_count=64`, `runs=3`, fresh backend, `shared_buffers=64MB`): - heap baseline, `seed_expand2_rerank_rel_in`: `0.102 ms` - plain `sorted_heap` SQL, `seed_expand2_rerank_rel_in`: `0.136 ms` - helper-composed `sorted_heap`, `seed_expand2_rerank_rel_topk_fn`: `0.105 ms` - fused `sorted_heap_expand_twohop_rerank(...)`: `0.081 ms` So on the medium slice, the dedicated helper now does what the composed path only hinted at: - it beats heap+btree on latency - it materially beats the composed two-hop helper path - it also cuts shared-buffer hits strongly (`421` vs `1298` for the heap baseline, and `421` vs `662` for the composed helper) On the larger real-text slice (`128 books x 256 paragraphs/book`, `58,954` rows, same settings except the larger corpus): - heap baseline, `seed_expand2_rerank_rel_in`: `0.114 ms` - plain `sorted_heap` SQL, `seed_expand2_rerank_rel_in`: `0.153 ms` - helper-composed `sorted_heap`, `seed_expand2_rerank_rel_topk_fn`: `0.111 ms` - fused `sorted_heap_expand_twohop_rerank(...)`: `0.092 ms` So the larger slice confirms the same shape: the dedicated two-hop helper is not a tiny micro-win on one probe set; it keeps the lead over both heap+btree and the composed helper. The same medium two-hop slice was also benchmarked against the external ANN seed paths: - `pgvector ANN -> heap 2-hop expansion -> exact rerank`: `0.253 ms` - `zvec ANN -> heap 2-hop expansion -> exact rerank`: `0.374 ms` - `Qdrant ANN -> heap 2-hop expansion -> exact rerank`: `1.789 ms` So the product-level conclusion stays consistent in the two-hop case as well: the narrow in-engine `sorted_heap` helper remains the fastest end-to-end GraphRAG path among the tested competitors on this real-text slice. At higher exact-rerank dimension, the advantage narrows again rather than disappearing: `64 books x 128 paragraphs/book`, `384D`, `query_count=64`, `runs=3`: - heap baseline, `seed_expand2_rerank_rel_in`: `0.225 ms` - plain `sorted_heap` SQL, `seed_expand2_rerank_rel_in`: `0.266 ms` - helper-composed `sorted_heap`, `seed_expand2_rerank_rel_topk_fn`: `0.258 ms` - fused `sorted_heap_expand_twohop_rerank(...)`: `0.236 ms` Interpretation: - the dedicated helper makes two-hop GraphRAG clearly viable on the real-text Gutenberg path - the latency win is still not universal; at higher dimensions it narrows toward parity with heap+btree - but the locality signal remains stronger than latency alone suggests (`1264` shared hits for the fused helper vs `3155` for the heap baseline on the `384D` medium run) So the correct next inference is narrower than "we need a graph storage engine" and also narrower than "we need a broad graph query layer": > a dedicated but still narrow two-hop helper is justified; anything broader > should now be treated as product/API design, not as a prerequisite for making > two-hop GraphRAG fast enough to matter. ## Higher-dimension rerun The same medium Gutenberg slice (`64 books x 128 paragraphs/book`) was then rerun at higher lexical-hash embedding dimensions to test whether the earlier result depended too heavily on the cheap `32D` setting. At `128D` (`query_count=64`, `runs=3`): - heap rerank baseline: `0.107 ms` - `sorted_heap_expand_rerank(... relation=2)`: `0.090 ms` - `sorted_heap_graph_rag_scan(... relation=2)`: `0.097 ms` - `pgvector ANN -> heap expansion -> exact rerank`: `0.386 ms` - `zvec ANN -> heap expansion -> exact rerank`: `0.518 ms` - `Qdrant ANN -> heap expansion -> exact rerank`: `1.732 ms` At `384D` on the same slice: - heap rerank baseline: `0.185 ms` - `sorted_heap_expand_rerank(... relation=2)`: `0.186 ms` - `sorted_heap_graph_rag_scan(... relation=2)`: `0.203 ms` - `pgvector ANN -> heap expansion -> exact rerank`: `0.815 ms` - `zvec ANN -> heap expansion -> exact rerank`: `1.101 ms` - `Qdrant ANN -> heap expansion -> exact rerank`: `2.275 ms` This changes the interpretation in one important way: - the `sorted_heap` helper remains clearly best-aligned with the full GraphRAG workflow versus the external ANN paths - but the win over the pure heap rerank baseline is **dimension-sensitive** - by `384D`, exact rerank cost dominates enough that the fused helper is only at parity with heap+btree rather than clearly ahead So the current evidence supports a narrower claim than "sorted_heap always wins GraphRAG": > the fused `sorted_heap` helper is the best end-to-end path among the tested > in-PG and external ANN competitors on this workflow shape, but its advantage > over heap+btree narrows substantially as exact rerank dimension grows One more tuning falsifier was useful here: - dropping `ann_k` from `32` to `24` on the `384D` medium slice does reduce latency - but it is **not** a free operating-point improvement - a direct result-set comparison for `sorted_heap_graph_rag_scan(...)` on the `64`-query probe set showed mismatches on `62/64` queries versus `ann_k=32` So the current faster-than-`ann_k=32` settings should be treated as a quality/latency tradeoff, not as a no-regression default recommendation. One important measurement caveat was also discovered and fixed during this work: - direct filtered `ORDER BY embedding <=> $query LIMIT K` on a base table with a `sorted_hnsw` index is **not** a valid GraphRAG baseline for current Phase 1 semantics - the automatic `sorted_hnsw` path is now explicitly costed out when extra base-relation quals are present - GraphRAG rerank baselines must therefore materialize the expanded set first, then rerank it This is enough to falsify the pessimistic branch: > the next useful GraphRAG step is not necessarily a new storage engine; a > carefully scoped C primitive can already recover a substantial part of the > lost latency ## Recommended roadmap ### Phase 0 — completed - Build local prototype benchmark - Falsify naive SQL assumptions ### Phase 1 — current `sorted_heap_expand_ids()` is implemented and regression-covered. ### Phase 2 — current `sorted_heap_expand_rerank()` is implemented and regression-covered. Current success criterion that was met: - beats the current `sorted_heap` SQL `seed_expand_in` / `seed_expand_rerank_in` patterns at medium scale Current gap that remains: - pure heap+btree expansion is still faster on this synthetic benchmark ### Phase 3 — next Add GraphRAG composition query: - ANN seed in SQL via `sorted_hnsw` - expansion via `sorted_heap_expand_ids()` - rerank via `sorted_heap_expand_rerank()` or SQL over materialized expansion ### Phase 4 — current `sorted_heap_graph_rag_scan()` is now implemented as the narrow one-call composition wrapper. ### Phase 5 — current `sorted_heap_expand_twohop_rerank()` is now implemented as the narrow fused two-hop helper. Current success criterion that was met: - beats the previous composed two-hop helper on the real-text Gutenberg graph - beats heap+btree on the medium and larger `32D` two-hop slices Current gap that remains: - at `384D`, the fused two-hop helper narrows to near-parity with heap+btree rather than keeping a clear lead ### Phase 6 — next Only if the current two-hop and one-call wrappers still leave meaningful headroom: - consider a broader wrapper for: - ANN seed IDs - two-hop expansion - rerank - or tune candidate count / rerank workload rather than broadening the API ## Cogniformerus-style multihop facts The real missing falsifier was not another paragraph graph slice. It was a benchmark that matches the current `cogniformerus` multihop question shape: - fact `1`: `person -> parent` - fact `2`: `parent -> city` - query: `Where does the parent of Person_i live?` That now exists in: - [`scripts/bench_graph_rag_multihop.py`](../scripts/bench_graph_rag_multihop.py) - [`scripts/sweep_graph_rag_multihop.py`](../scripts/sweep_graph_rag_multihop.py) The benchmark builds a deterministic fact graph and measures: - latency - `hit@1` - `hit@k` for the expected final `city` fact after two-hop expansion and rerank. ### Important contract discovery This benchmark immediately exposed a semantic limitation in the current convenience wrapper: - `sorted_heap_graph_rag_scan()` seeds expansion from ANN `target_id` - that is a good fit for the Gutenberg `paragraph -> next_paragraph` graph - it is **not** the right seed contract for the fact benchmark above - the fact benchmark needs ANN seeds based on `entity_id`, then: - hop 1 on relation `1` - hop 2 on relation `2` So the current one-call wrapper is still too specialized for this workload shape. The lower-level helper family is fine; the wrapper contract is the narrow part. That gap is now closed by: - `sorted_heap_graph_rag_twohop_scan(...)` This wrapper keeps the fact-shaped contract narrow: - ANN seed on `entity_id` - hop 1 relation filter - hop 2 relation filter - final rerank delegated to `sorted_heap_expand_twohop_rerank(...)` ### Early failure that mattered At `32D`, the fact benchmark initially produced very poor answer retrieval. That was a benchmark-quality failure, not a helper failure: - the first draft seeded on `target_id`, which was the wrong graph contract - after fixing that, the deterministic query embedding was still too weak at low dimension to make the question reliably retrievable So the publishable multihop results start at `384D`, where the question shape becomes stable enough that latency numbers mean something. ### Tuned 384D result On `5K` multihop chains (`10K` rows total), `64` queries, `3` runs, `shared_buffers=64MB`, fresh backend, with: - `ann_k=64` - `sorted_hnsw.ef_search=64` - `ef_construction=200` the current frontier is: - heap composed two-hop SQL - `0.515 ms` - `hit@1 = 71.9%` - `hit@k = 85.9%` - `sorted_heap` composed two-hop helper - `0.471 ms` - `hit@1 = 70.3%` - `hit@k = 82.8%` - `sorted_heap_expand_twohop_rerank()` - `0.442 ms` - `hit@1 = 70.3%` - `hit@k = 82.8%` - `sorted_heap_graph_rag_twohop_scan()` - `0.417 ms` - `hit@1 = 71.9%` - `hit@k = 84.4%` - pgvector - `1.397 ms` - `hit@1 = 70.3%` - `hit@k = 87.5%` - zvec - `1.076 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` - Qdrant - `2.921 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` Interpretation: - the fused two-hop helper is now the **fastest PostgreSQL path** on this fact-shaped workload - the new fact-shaped one-call wrapper stays effectively at parity with the fused helper, so this time the convenience API does **not** erase the win - it remains materially faster than pgvector on the same workflow - it is **not** the quality leader at this operating point - zvec and Qdrant still win on answer retrieval quality here, but at much higher latency ### Seed frontier after the wrapper fix The next honest question was not API shape but ANN seed quality. That is now measured directly by: - [`scripts/sweep_graph_rag_multihop.py`](../scripts/sweep_graph_rag_multihop.py) This harness keeps the corpus fixed per `ef_construction` and sweeps: - `m` - `ann_k` - `sorted_hnsw.ef_search` - `ef_construction` without paying a full temp-cluster and schema rewrite for every single probe point. On the same `5K` chains / `10K` rows / `384D` / `64` queries / fresh-backend benchmark, the stable wrapper frontier is now: - `ef_construction=64`, `ann_k=64`, `ef_search=64` - `0.386 ms` - `hit@1 = 70.3%` - `hit@k = 82.8%` - `ef_construction=200`, `ann_k=64`, `ef_search=64` - `0.393 ms` - `hit@1 = 71.9%` - `hit@k = 84.4%` - `ef_construction=400`, `ann_k=64`, `ef_search=64` - `0.421 ms` - `hit@1 = 70.3%` - `hit@k = 85.9%` - `ef_construction=200`, `ann_k=64`, `ef_search=128` - `0.651 ms` - `hit@1 = 73.4%` - `hit@k = 95.3%` - `ef_construction=400`, `ann_k=64`, `ef_search=128` - `0.663 ms` - `hit@1 = 75.0%` - `hit@k = 95.3%` For a higher-quality but much slower seed tier: - `ann_k=96`, `ef_search=64` lands around `2.2-2.4 ms` with `hit@k = 96.9%` That leads to a narrower, more honest recommendation: - if latency is the hard constraint, keep the fast tier near `ef_construction=200`, `ann_k=64`, `ef_search=64` - if answer quality matters more, the best balanced point we measured is `ef_construction=200`, `ann_k=64`, `ef_search=128` - `ef_construction=400` does improve `hit@1` slightly at the same `95.3%` `hit@k`, but it does not improve `hit@k` over `200`, so it should not be the default recommendation without a separate build-cost justification That build-cost justification now exists too on this exact `10K x 384D` multihop benchmark: - `ef_construction=64`: `43.716 s` to build both ANN indexes - `ef_construction=200`: `80.046 s` - `ef_construction=400`: `91.352 s` So the current recommendation is: - default to `ef_construction=200` - treat `ef_construction=400` as a niche `hit@1` knob, not the new default ### `m` frontier on the same multihop benchmark The next useful falsifier was whether graph degree buys more than another `ef_construction` increase. Keeping: - `ef_construction=200` - `ann_k=64` - `64` queries - `3` runs - fresh backend the `m` sweep came out as: - `m=16`, `ef_search=64` - `0.405 ms` - `hit@1 = 71.9%` - `hit@k = 87.5%` - `m=24`, `ef_search=64` - `0.466 ms` - `hit@1 = 75.0%` - `hit@k = 93.8%` - `m=32`, `ef_search=64` - `0.491 ms` - `hit@1 = 78.1%` - `hit@k = 93.8%` - `m=16`, `ef_search=128` - `0.672 ms` - `hit@1 = 73.4%` - `hit@k = 95.3%` - `m=24`, `ef_search=128` - `0.738 ms` - `hit@1 = 75.0%` - `hit@k = 96.9%` - `m=32`, `ef_search=128` - `0.771 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` The one-off build-cost probe for the same `10K x 384D` graph was: - `m=16`, `ef_construction=200`: `79.425 s` - `m=24`, `ef_construction=200`: `86.562 s` - `m=32`, `ef_construction=200`: `75.404 s` That last `m=32` build number should be treated cautiously; it was a single one-off probe and is likely noisy enough that only the query-time frontier is trustworthy here. The stable conclusion is still clear: - `m=24` is the best current quality-per-latency tradeoff we measured - `m=32` buys a little more `hit@1`, but no additional `hit@k` - so for fact-shaped multihop GraphRAG, the best current balanced point is: - `m=24` - `ef_construction=200` - `ann_k=64` - `sorted_hnsw.ef_search=128` One more ann_k falsifier matters here too: - increasing `ann_k` above `64` at this `m=24 / ef_construction=200 / ef_search=128` point did **not** help - `ann_k=80/96/128` all increased latency and reduced `hit@k` - so `ann_k=64` remains the current sweet spot, not just a legacy default ### Full parity rerun at the balanced point Re-running the full multihop parity benchmark on that exact setting: - `m=24` - `ef_construction=200` - `ann_k=64` - `sorted_hnsw.ef_search=128` - `64` queries - `3` runs - `384D` produced: - heap two-hop SQL - `0.762 ms` - `hit@1 = 75.0%` - `hit@k = 96.9%` - `sorted_heap_expand_twohop_rerank()` - `0.726 ms` - `hit@1 = 75.0%` - `hit@k = 96.9%` - `sorted_heap_graph_rag_twohop_scan()` - `0.727 ms` - `hit@1 = 75.0%` - `hit@k = 96.9%` - pgvector - `1.244 ms` - `hit@1 = 70.3%` - `hit@k = 85.9%` - zvec - `0.927 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` - Qdrant - `2.417 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` That is a materially stronger result than the earlier `m=16` baseline: - the fused `sorted_heap` path now matches `zvec` and `Qdrant` on `hit@k` - it stays faster than both external paths - it also beats pgvector on both latency and answer quality on this workload - `zvec` and `Qdrant` still keep a small `hit@1` edge, so the answer-quality story is now about `hit@1`, not `hit@k` ### Full parity rerun at the higher-quality point The next question was whether that remaining `hit@1` gap could be closed without giving back the latency lead. Re-running the same full parity benchmark at: - `m=32` - `ef_construction=200` - `ann_k=64` - `sorted_hnsw.ef_search=128` produced: - heap two-hop SQL - `0.810 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` - `sorted_heap_expand_twohop_rerank()` - `0.774 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` - `sorted_heap_graph_rag_twohop_scan()` - `0.786 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` - pgvector - `1.220 ms` - `hit@1 = 70.3%` - `hit@k = 84.4%` - zvec - `0.874 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` - Qdrant - `2.487 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` So the current picture is now more precise: - `m=24` is still the better quality-per-latency recommendation - `m=32` is the point where `sorted_heap` reaches full observed parity with `zvec` and Qdrant on both `hit@1` and `hit@k` - even at that higher-quality point, the `sorted_heap` helper remains faster than both external paths - pgvector remains behind on both latency and answer quality on this workload ### AWS ARM64 parity rerun (`5K` chains) The next environment-variance adversary check was to rerun the same `5K`-chain / `10K`-row / `384D` fact benchmark on an AWS ARM64 host (`4 vCPU`, `8 GiB RAM`) using the repo-owned wrapper: - [`scripts/bench_graph_rag_multihop_aws.sh`](../scripts/bench_graph_rag_multihop_aws.sh) At the previously recommended local balanced point: - `m=24` - `ef_construction=200` - `ann_k=64` - `sorted_hnsw.ef_search=128` - `64` queries - `3` runs - fresh backend the AWS rerun produced: - heap two-hop SQL - `1.087 ms` - `hit@1 = 75.0%` - `hit@k = 96.9%` - `sorted_heap_expand_twohop_rerank()` - `0.947 ms` - `hit@1 = 76.6%` - `hit@k = 98.4%` - `sorted_heap_graph_rag_twohop_scan()` - `1.004 ms` - `hit@1 = 76.6%` - `hit@k = 98.4%` - pgvector - `1.296 ms` - `hit@1 = 70.3%` - `hit@k = 85.9%` - zvec - `1.646 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` - Qdrant - `3.396 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` That is stronger than the local balanced point in one important way: - on this AWS rerun, `sorted_heap` does not just match `zvec` and Qdrant on `hit@k`; it exceeds them (`98.4%` vs `96.9%`) while staying faster than both But the second half of the adversary check matters too. Re-running the same AWS benchmark at the local higher-quality point: - `m=32` - `ef_construction=200` - `ann_k=64` - `sorted_hnsw.ef_search=128` produced: - `sorted_heap_graph_rag_twohop_scan()` - `1.066 ms` - `hit@1 = 76.6%` - `hit@k = 96.9%` So the local `m=32` parity story does **not** carry over unchanged to this AWS ARM64 environment. The portable conclusion is therefore narrower: - `m=24 / ef_construction=200 / ann_k=64 / ef_search=128` is the current best verified cross-environment point - local and AWS frontiers are directionally consistent, but not numerically identical - this is exactly why the AWS rerun is worth keeping as a separate falsifier, not merging blindly into the local tuning story ### Larger local scale check (`10K` chains) The next adversary check was whether the `5K`-chain tuning carried forward to a larger local fact graph without retuning. On `10K` chains (`20K` rows total), `64` queries, `384D`, fresh backend: - `m=24`, `ef_construction=200`, `ann_k=64`, `ef_search=128` - `sorted_heap_graph_rag_twohop_scan()` -> `0.885 ms` - `hit@1 = 71.9%` - `hit@k = 92.2%` - `m=32`, `ef_construction=200`, `ann_k=64`, `ef_search=128` - `sorted_heap_graph_rag_twohop_scan()` -> `0.972 ms` - `hit@1 = 73.4%` - `hit@k = 93.8%` So the `5K`-chain operating point does **not** generalize unchanged. The next narrow falsifier was whether this larger-graph drop was just a search beam issue. Sweeping `ef_search` upward at `m=32` gave: - `ef_search=192` - `1.310 ms` - `hit@1 = 76.6%` - `hit@k = 95.3%` - `ef_search=256` - `1.734 ms` - `hit@1 = 78.1%` - `hit@k = 95.3%` That is a useful but incomplete recovery: - higher `ef_search` does recover part of the quality loss - it does **not** recover the earlier `96.9% hit@k` local point - so the larger-graph gap is not purely a beam-width problem The next falsifier after that was stronger graph construction. On the same `10K`-chain graph, keeping `m=32`, `ann_k=64`, and comparing `ef_construction=200` vs `400` gave: - at `ef_search=128` - `ef_construction=200` -> `0.976 ms`, `hit@1 = 75.0%`, `hit@k = 93.8%` - `ef_construction=400` -> `1.094 ms`, `hit@1 = 75.0%`, `hit@k = 93.8%` - at `ef_search=192` - `ef_construction=200` -> `1.357 ms`, `hit@1 = 76.6%`, `hit@k = 95.3%` - `ef_construction=400` -> `1.381 ms`, `hit@1 = 76.6%`, `hit@k = 95.3%` So this larger-graph gap is not fixed by a simple `ef_construction=400` bump either. The current best explanation is therefore narrower: - the verified `5K`-chain local frontier is real - the same operating points do not carry forward unchanged to `10K` chains - and the obvious local rescue knobs (`ef_search`, `ef_construction`) only recover part of the drop That is enough to stop local knob-turning for this pass. The next useful step would be a different class of experiment, not more of the same sweep. The next adversary check after that was whether this larger-graph caveat was just a local-machine artifact. Re-running the `10K`-chain benchmark on the same AWS ARM64 host (`4 vCPU`, `8 GiB RAM`) showed that it is not. At the same balanced portable point: - `m=24` - `ef_construction=200` - `ann_k=64` - `sorted_hnsw.ef_search=128` the AWS rerun produced: - heap two-hop SQL - `1.389 ms` - `hit@1 = 71.9%` - `hit@k = 92.2%` - `sorted_heap_expand_twohop_rerank()` - `1.190 ms` - `hit@1 = 71.9%` - `hit@k = 92.2%` - `sorted_heap_graph_rag_twohop_scan()` - `1.248 ms` - `hit@1 = 71.9%` - `hit@k = 92.2%` That essentially matches the larger local result. So the `10K`-chain drop is cross-environment robust, not just a local Apple/M-series artifact. The one meaningful local rescue point transferred cleanly to AWS too. Re-running the `10K`-chain benchmark at: - `m=32` - `ef_construction=200` - `ann_k=64` - `sorted_hnsw.ef_search=192` produced: - heap two-hop SQL - `1.896 ms` - `hit@1 = 76.6%` - `hit@k = 95.3%` - `sorted_heap_expand_twohop_rerank()` - `1.617 ms` - `hit@1 = 76.6%` - `hit@k = 95.3%` - `sorted_heap_graph_rag_twohop_scan()` - `1.687 ms` - `hit@1 = 76.6%` - `hit@k = 95.3%` So the larger-scale picture is now materially stronger: - the `10K`-chain quality drop is cross-environment robust - the best current larger-graph recovery point is also cross-environment robust: `m=32 / ef_search=192` - but even that recovery point does **not** restore the earlier `5K`-chain `98.4% hit@k` AWS frontier - so the remaining gap is unlikely to be solved by another trivial `ef_search` or `m` tweak alone ### Exact-seed upper-bound diagnostic The next root-cause check was to remove ANN approximation from the seed stage entirely. The multihop harness now supports an `--exact-seed-diagnostics` mode, which replaces ANN seed retrieval with exact brute-force top-K seeds on `facts_heap`, then reuses the same graph expansion/rerank path. This matters because it separates two very different explanations: - "the remaining gap is caused by approximate ANN seeds" - "the remaining gap is already in the benchmark/query/task shape" On the `5K`-chain balanced local point: - `m=24` - `ef_construction=200` - `ann_k=64` - `sorted_hnsw.ef_search=128` the exact-seed diagnostic did **not** improve quality: - ANN-seeded `sorted_heap_expand_twohop_rerank()` - `0.702 ms` - `hit@1 = 75.0%` - `hit@k = 96.9%` - exact-seeded `sorted_heap_expand_twohop_rerank()` - `0.811 ms` - `hit@1 = 75.0%` - `hit@k = 96.9%` And the seed-stage diagnostic showed no hidden ANN loss there either: - ANN seeds - `seed_person_pct = 98.4%` - `expanded_city_pct = 98.4%` - `avg_person_rank = 1.00` - `city_rank_p95 = 6` - `city_rank_max = 17` - exact seeds - `seed_person_pct = 98.4%` - `expanded_city_pct = 98.4%` - `avg_person_rank = 1.00` - `city_rank_p95 = 6` - `city_rank_max = 17` So even at `5K`, the final `96.9% hit@k` is already below seed coverage. But the rerank distribution is still concentrated: the correct city stays within the top 6 for 95% of reachable queries, and the miss comes from a small number of sharper outliers. On the `10K`-chain balanced local point: - `m=24` - `ef_construction=200` - `ann_k=64` - `sorted_hnsw.ef_search=128` the exact-seed diagnostic again did **not** improve quality: - ANN-seeded `sorted_heap_expand_twohop_rerank()` - `0.839 ms` - `hit@1 = 71.9%` - `hit@k = 92.2%` - exact-seeded `sorted_heap_expand_twohop_rerank()` - `0.947 ms` - `hit@1 = 71.9%` - `hit@k = 92.2%` The seed-stage diagnostic was even more revealing on `10K`: - ANN seeds - `seed_person_pct = 96.9%` - `expanded_city_pct = 96.9%` - `avg_person_rank = 1.00` - `city_rank_p95 = 3` - `city_rank_max = 20` - exact seeds - `seed_person_pct = 96.9%` - `expanded_city_pct = 96.9%` - `avg_person_rank = 1.00` - `city_rank_p95 = 3` - `city_rank_max = 19` So the larger-graph gap is **not** coming from missing the correct seed fact. At `10K`, seed coverage stays at `96.9%`, but final `hit@k` drops to `92.2%`. And it is not a broad rerank collapse either: for 95% of reachable queries the correct city still ranks in the top 3, but a few outliers fall as far as rank `19-20`, which is enough to miss `top_k = 10`. This is a strong falsifier: - on this synthetic fact benchmark, the current `5K` and `10K` frontiers are **not** ANN-approximation limited at the tested operating points - ANN and exact seeds have identical seed coverage on both scales - the remaining gap is mostly an outlier-ranking problem, not a broad seed or rerank failure - exact seeds cost extra latency but do not recover answer quality - so the next meaningful gain is unlikely to come from more seed-ANN tuning alone The remaining gap now looks more like a property of the task construction, query embedding, or graph benchmark semantics than of `sorted_hnsw` approximation itself. More specifically: the dominant remaining loss now looks downstream of seed retrieval, not inside it, and it is concentrated in a small set of bad cases rather than a general degradation across the query set. So the honest story on this fact benchmark is a latency/quality frontier: - `sorted_heap_expand_twohop_rerank()` leads on latency ### Path-aware rerank diagnostic The next falsifier was to keep the same ANN seeds and the same two-hop expansion, but change only the final scorer. The current multihop helper reranks on the hop-2 city fact embedding alone. A path-aware SQL baseline was added to the harness that scores each candidate as: - `path_distance = (hop1_embedding <=> query) + (hop2_embedding <=> query)` That simple change materially improved answer quality on the same balanced points: - `5K` chains, `m=24`, `ef_construction=200`, `ann_k=64`, `sorted_hnsw.ef_search=128` - city-only `sorted_heap_graph_rag_twohop_scan()` - `0.762 ms` - `hit@1 = 75.0%` - `hit@k = 96.9%` - path-aware SQL rerank on `facts_sh` - `0.957 ms` - `hit@1 = 98.4%` - `hit@k = 98.4%` - `10K` chains, same knobs - city-only `sorted_heap_graph_rag_twohop_scan()` - `0.937 ms` - `hit@1 = 71.9%` - `hit@k = 92.2%` - path-aware SQL rerank on `facts_sh` - `1.179 ms` - `hit@1 = 95.3%` - `hit@k = 96.9%` This is the strongest current architectural signal on the fact-shaped benchmark: - the remaining quality gap is not well explained by seed recall - it is also not well explained by broad rerank collapse - a simple path-aware scorer recovers most of the lost quality with only a modest latency increase That branch is now implemented locally too: - `sorted_heap_expand_twohop_path_rerank(...)` - `sorted_heap_graph_rag_twohop_path_scan(...)` And the fused helper beats the SQL path-aware baseline on the same balanced points: - `5K` chains - SQL path-aware baseline: `0.847 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%` - fused helper: `0.726 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%` - one-call wrapper: `0.739 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%` - `10K` chains - SQL path-aware baseline: `0.942 ms`, `hit@1 = 95.3%`, `hit@k = 96.9%` - fused helper: `0.823 ms`, `hit@1 = 95.3%`, `hit@k = 96.9%` - one-call wrapper: `0.834 ms`, `hit@1 = 95.3%`, `hit@k = 96.9%` So for multihop fact retrieval, the next serious question is no longer whether path-aware rerank helps. It does. The next question is whether this new helper/wrapper transfers cleanly to AWS and then to a real `cogniformerus`-like corpus. That AWS transfer is now verified too. On AWS ARM64 (`4 vCPU`, `8 GiB RAM`), at the same balanced `m=24 / ef_construction=200 / ann_k=64 / ef_search=128` point: - `5K` chains - heap two-hop SQL: `1.088 ms`, `hit@1 = 75.0%`, `hit@k = 96.9%` - city-only wrapper: `1.012 ms`, `hit@1 = 75.0%`, `hit@k = 96.9%` - SQL path-aware baseline: `1.204 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%` - fused helper: `0.955 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%` - one-call path-aware wrapper: `1.018 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%` - pgvector + heap expansion, same path-aware scorer: `1.422 ms`, `hit@1 = 85.9%`, `hit@k = 85.9%` - zvec + heap expansion, same path-aware scorer: `1.720 ms`, `hit@1 = 100.0%`, `hit@k = 100.0%` - Qdrant + heap expansion, same path-aware scorer: `3.435 ms`, `hit@1 = 100.0%`, `hit@k = 100.0%` - `10K` chains, same knobs - heap two-hop SQL: `1.319 ms`, `hit@1 = 71.9%`, `hit@k = 92.2%` - city-only wrapper: `1.197 ms`, `hit@1 = 73.4%`, `hit@k = 93.8%` - SQL path-aware baseline: `1.436 ms`, `hit@1 = 96.9%`, `hit@k = 98.4%` - fused helper: `1.185 ms`, `hit@1 = 96.9%`, `hit@k = 98.4%` - one-call path-aware wrapper: `1.212 ms`, `hit@1 = 96.9%`, `hit@k = 98.4%` So the answer to the transfer question is now yes: the path-aware helper and wrapper survive the AWS move cleanly, and the old larger-scale caveat narrows substantially once the rerank contract is fixed. This also closes the earlier apples-to-apples gap. Once all engines are scored under the same path-aware contract: - `sorted_heap` is the latency leader - `zvec` and Qdrant hold the strongest observed answer quality - `pgvector` remains behind on both latency and quality at this operating point One AWS all-engines rerun briefly dropped the `sorted_heap` path-aware rows to `96.9% / 96.9%`, but an immediate `sorted_heap`-only control and a second full rerun both returned `98.4% / 98.4%`. So the portable parity story now has one verified outlier plus two confirming reruns. That was enough to justify the benchmark note, and it directly motivated the repeated-build protocol recorded below. ## Repeated-build local variance - [`scripts/repeat_graph_rag_multihop_builds.py`](../scripts/repeat_graph_rag_multihop_builds.py) - [`scripts/repeat_graph_rag_multihop_builds_aws.sh`](../scripts/repeat_graph_rag_multihop_builds_aws.sh) It wraps [`scripts/bench_graph_rag_multihop.py`](../scripts/bench_graph_rag_multihop.py) so each repeat gets a fresh temp cluster and a fresh HNSW build, then reports median / min / max for selected rows. On the balanced local `5K` point (`m=24 / ef_construction=200 / ann_k=64 / ef_search=128`), three independent rebuilds produced: - `sorted_heap_expand_twohop_path_rerank()` - `p50_ms`: median `0.798`, range `0.771-0.819` - `hit@1 = 98.4%`, `hit@k = 98.4%` on all three builds - `sorted_heap_graph_rag_twohop_path_scan()` - `p50_ms`: median `0.796`, range `0.778-0.804` - `hit@1 = 98.4%`, `hit@k = 98.4%` on all three builds - `pgvector` path-aware parity row - `p50_ms`: median `1.405`, range `1.318-1.456` - `hit@1/hit@k`: `85.9-89.1%` - `zvec` path-aware parity row - `p50_ms`: median `1.076`, range `1.053-1.087` - `hit@1 = 100.0%`, `hit@k = 100.0%` on all three builds - `Qdrant` path-aware parity row - `p50_ms`: median `2.799`, range `2.792-2.805` - `hit@1 = 100.0%`, `hit@k = 100.0%` on all three builds So the balanced local path-aware `sorted_heap` point is not just a lucky single build. The answer quality stayed fixed across rebuilds, and the latency spread was narrow. The remaining variance story now looks more like: - local balanced `sorted_heap`: stable across rebuilds - AWS balanced `sorted_heap`: also stable across repeated builds on the `5K` point, with one earlier outlier now downgraded to an anomaly - `pgvector`: measurable quality drift across local rebuilds - `zvec` / `Qdrant`: stable on this deterministic local fact graph The AWS repeated-build protocol on the balanced `5K` point produced: - `sorted_heap_expand_twohop_path_rerank()` - `p50_ms`: median `0.962`, range `0.956-0.965` - `hit@1 = 98.4%`, `hit@k = 98.4%` on all three builds - `sorted_heap_graph_rag_twohop_path_scan()` - `p50_ms`: median `1.025`, range `1.018-1.043` - `hit@1 = 98.4%`, `hit@k = 98.4%` on all three builds - `pgvector` path-aware parity row - `p50_ms`: median `1.434`, range `1.370-1.493` - `hit@1/hit@k`: `84.4-89.1%` - `zvec` path-aware parity row - `p50_ms`: median `1.711`, range `1.703-1.768` - `hit@1 = 100.0%`, `hit@k = 100.0%` on all three builds - `Qdrant` path-aware parity row - `p50_ms`: median `3.355`, range `3.302-3.465` - `hit@1 = 100.0%`, `hit@k = 100.0%` on all three builds So the current confidence picture is stronger than before: - local balanced `5K`: repeated-build stable - AWS balanced `5K`: repeated-build stable - larger `10K` AWS path-aware rows: repeated-build stable too, but at a lower quality frontier than `5K` The AWS repeated-build protocol on the larger `10K` point produced: - `sorted_heap_expand_twohop_path_rerank()` - `p50_ms`: median `1.177`, range `1.148-1.191` - `hit@1 = 95.3%`, `hit@k = 96.9%` on all three builds - `sorted_heap_graph_rag_twohop_path_scan()` - `p50_ms`: median `1.236`, range `1.211-1.240` - `hit@1 = 95.3%`, `hit@k = 96.9%` on all three builds - `pgvector` path-aware parity row - `p50_ms`: median `1.667`, range `1.665-1.676` - `hit@1/hit@k`: `76.6-82.8%` - `zvec` path-aware parity row - `p50_ms`: median `2.788`, range `2.762-2.789` - `hit@1 = 98.4%`, `hit@k = 100.0%` on all three builds - `Qdrant` path-aware parity row - `p50_ms`: median `3.818`, range `3.788-3.846` - `hit@1 = 98.4%`, `hit@k = 100.0%` on all three builds This sharpens the conclusion again: - the `10K` AWS point is no longer a variance question - it is a real scale frontier - `sorted_heap` remains the latency leader there - `zvec` and Qdrant still lead on answer quality This also falsifies one tempting but wrong simplification: > once the helper is fast, the remaining GraphRAG problem is solved Not quite. On fact-shaped multihop queries, seed ANN quality and graph build quality still matter enough that `ann_k`, `ef_search`, and graph build quality remain first-class tuning knobs. But the old hop-2-only rerank contract was a separate, larger problem, and the new path-aware helper fixes most of it on the current local benchmark. ## Current verdict `sorted_heap` already has a plausible GraphRAG foundation, and the new helper proves that a narrow C primitive can materially improve the GraphRAG path. What is now true: - SQL-only GraphRAG composition was not enough - `sorted_heap_expand_ids()` is enough to recover a large part of that gap - `sorted_heap_expand_rerank()` recovers most of the rerank overhead on the current `sorted_heap` path - `sorted_heap_graph_rag_scan()` makes the composition available as a single SQL call without giving back much latency - `sorted_heap_expand_twohop_rerank()` turns the earlier two-hop composition evidence into a real latency win on the real-text Gutenberg slices we tested - on the cogniformerus-style `person -> parent -> city` benchmark, the fused two-hop helper is the fastest PostgreSQL path we tested - `sorted_heap_graph_rag_twohop_scan()` closes the current fact-shaped wrapper gap without materially giving back latency - `sorted_heap_expand_twohop_path_rerank()` upgrades the fact-shaped rerank contract to use hop-1 and hop-2 evidence together - `sorted_heap_graph_rag_twohop_path_scan()` makes that path-aware contract available as a single-call primitive - the path-aware helper and wrapper transfer cleanly from local to AWS ARM64 on the same balanced `m=24 / ef_construction=200 / ann_k=64 / ef_search=128` point - the narrow-helper direction is a justified building block - the current helper model already composes into a competitive two-hop real-text GraphRAG path on Gutenberg without requiring a new graph API - on the real-text GraphRAG shape, `pgvector` parity is already materially worse end-to-end than the fused `sorted_heap` helper path - on the fact-shaped AWS path-aware benchmark, `sorted_heap` is now the fastest verified end-to-end path, while `zvec` and Qdrant remain the answer quality leaders - `zvec` is stable on the medium slice but currently not robust on the larger real-text slice at `ann_k=32` - `Qdrant` is robust on both real-text slices but materially slower than the fused `sorted_heap` helper on the same workflow What is not yet true: - `sorted_heap` is not yet clearly better than heap+btree on pure expansion latency for this synthetic workload - even the relation-filtered GraphRAG path still trails heap+btree slightly on this synthetic benchmark - two-hop helper composition is not yet a universal latency win; at higher rerank dimensions it narrows to parity with heap+btree rather than staying clearly ahead - the current benchmark suite is still deterministic/synthetic rather than a real `cogniformerus` corpus, so the remaining generalization gap is about workload realism more than about build variance - transfer to a larger real `cogniformerus` corpus is still unverified; the current fact-shaped benchmark is deterministic and synthetic even though it matches the intended multihop query shape ## Actual Butler gate seed-corpus smoke The next honest step after the synthetic-chain work was to stop guessing and run the path-aware GraphRAG helpers on the actual tiny multihop corpus that `cogniformerus` already ships in its Butler gate smoke: - source: `cogniformerus/bin/butler_small_model_eval.cr` - repo-owned fixture: [`scripts/fixtures/graph_rag_butler_gate_seed.json`](../scripts/fixtures/graph_rag_butler_gate_seed.json) - harness: [`scripts/bench_graph_rag_butler_gate.py`](../scripts/bench_graph_rag_butler_gate.py) This fixture is intentionally tiny: - `7` graph facts loaded into `facts_heap` / `facts_sh` - `2` positive multihop queries - `Project Atlas -> Orion -> Helsinki` - `Release 13 -> Aurora -> April` So it is not a publishable latency frontier. Its job is narrower: - verify that the current path-aware helper and wrapper work on the real Butler gate fact texts and prompts - replace the previous blanket statement "real cogniformerus still unverified" with a tighter one: the actual gate seed corpus is covered, but larger real corpora are not The first local smoke run on this real gate seed corpus used: - `384D` - `ann_k=4` - `top_k=4` - `m=24` - `ef_construction=200` - `sorted_hnsw.ef_search=64` - `5` timing runs on a fresh temp cluster Result: - heap path-aware SQL baseline: - `p50 0.027 ms` - `hit@1/hit@k = 100/100` - `facts_sh` path-aware SQL baseline: - `p50 0.026 ms` - `hit@1/hit@k = 100/100` - `sorted_heap_expand_twohop_path_rerank()`: - `p50 0.017 ms` - `hit@1/hit@k = 100/100` - `sorted_heap_graph_rag_twohop_path_scan()`: - `p50 0.045 ms` - `hit@1/hit@k = 100/100` This does not prove scale behavior. It proves something narrower and still useful: the current path-aware GraphRAG helper/wrapper contract works on the actual Butler gate seed facts and prompts, not only on the synthetic `person -> parent -> city` generator. One adversary control also mattered here: this was not only a pass at a near-full seed budget. Re-running the same smoke at `ann_k=2`, `top_k=2` still kept both multihop queries at `100/100`. The correct next step is therefore: > **tune the current narrow helper family before considering a bigger > graph-specific subsystem** That remains the smallest change that can still convert the observed block-pruning advantage into an end-to-end query win. ## Real code-corpus prototype The next honest check after the Butler gate fact smoke was not another synthetic graph. It was the actual `cogniformerus` code corpus plus the real cross-file question bank already used by Butler's own code benchmark. - source tree: `cogniformerus/src/cogniformerus` - question source: `cogniformerus/bin/butler_code_test.cr` - harness: [`scripts/bench_graph_rag_code_corpus.py`](../scripts/bench_graph_rag_code_corpus.py) This harness builds a narrow code-GraphRAG shape: - each source file is one entity - each chunk in that file becomes one fact row - `entity_id = file_id` - `relation_id = HAS_CHUNK` - `target_id = chunk_id` - query quality is scored against the real CrossFile benchmark keywords from `butler_code_test.cr` This is not a full code graph. It is a bounded falsifier for a simpler claim: > if GraphRAG-style seeded expansion is already useful on a real corpus, it > should show up even on the natural `file -> chunk` expansion shape The first stable local point used: - `40` files - `747` chunk rows - `6` real CrossFile questions - `384D` - `ann_k=16` - `top_k=4` - `m=24` - `ef_construction=200` - `sorted_hnsw.ef_search=64` - `shared_buffers=64MB` - fresh backend - `3` timing runs Result: - direct ANN over raw chunks: - heap: `p50 0.740 ms` - `sorted_heap`: `p50 0.712 ms` - keyword coverage: `63.3%` - full-keyword hits: `33.3%` - file-seeded SQL expansion: - heap: `p50 0.516 ms` - `sorted_heap`: `p50 0.468 ms` - same `63.3%` keyword coverage - same `33.3%` full-keyword hits - `sorted_heap_expand_rerank()` helper: - `p50 0.665 ms` - same `63.3%` keyword coverage - same `33.3%` full-keyword hits The important conclusion is narrow but real: - the real code corpus branch is now reproducible inside this repository - seeded expansion by file preserves answer-support quality on the real CrossFile question set - on this code corpus, the current gain is **latency**, not answer quality - the helper is not yet the latency leader on this tiny real corpus; the simple SQL expansion shape still wins locally This means the next code-corpus GraphRAG step is not "invent a bigger graph API". It is either: - a richer real code-graph relation hypothesis than plain `file -> chunk`, or - a lower-overhead helper path for this very simple expansion contract ### Real `require`-graph falsifier The obvious next hypothesis was that plain `file -> chunk` was too weak, and that the real local code graph should help once actual `require` edges were present. That hypothesis is now tested in the same harness: - `53` local `require` edges derived from the real `cogniformerus` source tree - relation `REQUIRES_FILE` - two new query shapes: - `seed_require_twohop_*` - `seed_file_plus_require_in` Stable local result on the same `40`-file / `800`-row / `6`-question point, `3` runs: - plain file-seeded expansion: - `sorted_heap`: `0.471 ms` - keyword coverage: `63.3%` - full hits: `33.3%` - file plus required files: - `sorted_heap`: `0.605 ms` - same `63.3%` keyword coverage - same `33.3%` full hits - dependency-only two-hop: - `sorted_heap`: `0.391 ms` - keyword coverage: `20.0%` - full hits: `0.0%` So the richer real relation hypothesis is currently **refuted** on this code corpus: - adding dependency files does not improve answer-support quality - dependency-only traversal is actively worse because it drops own-file context - unioning own files with required files only adds cost, not quality This is a useful stopping point. The next likely win for real code-GraphRAG is not "just add more code edges". It is a different retrieval contract or a lower-overhead helper path on the already-good file-seeded shape. ### File-summary seed falsifier The next retrieval-contract hypothesis was also tested locally on the same real code corpus: - add one synthetic-but-data-derived summary row per file - seed on those summary rows - then expand back to the file's chunk rows The goal was to test whether the missing factor was simply that chunk-level ANN was a poor way to choose files. That also failed to improve answer-support quality. Stable smoke result on the same `40`-file / `840`-row / `6`-question point: - summary-seeded expansion: - heap: `0.587 ms` - `sorted_heap`: `0.564 ms` - keyword coverage: `63.3%` - full hits: `33.3%` So the current real code-corpus plateau is now bounded more tightly: - plain file-seeded expansion: same quality, lower latency - file summaries: same quality, higher latency - require edges: no quality gain - require-only traversal: quality regression That strongly suggests the next code-corpus GraphRAG branch should not be "more local graph structure" or "better file seeds" in the same lexical setup. The remaining frontier is more likely one of: - a different quality metric / question contract, - better embeddings, - or a lower-overhead execution path on the already-best file-seeded shape. ### Oracle-seed and oracle-rerank diagnostic The next adversary question was sharper: > is the plateau really about bad file seeds, or is it already downstream in > the rerank / evaluation contract? The harness now includes two explicit oracle diagnostics on the same real code corpus: - **oracle file seeds** - choose seed files by benchmark-keyword overlap against the full file text - this is not a deployable retrieval contract; it is a diagnostic ceiling - **prompt-derived lexical rerank** - keep the same ANN-derived file seeds - rerank by lexical overlap with terms extracted from the actual user prompt - this is deployable in principle, but much weaker than the oracle signal - **oracle keyword rerank** - keep the same ANN-derived file seeds - rerank the expanded chunk rows by direct overlap with the benchmark's gold CrossFile keywords before falling back to embedding distance Stable local result, `3` runs, same `40`-file / `840`-row / `6`-question point: - plain file-seeded expansion: - `sorted_heap`: `0.443 ms` - keyword coverage: `63.3%` - full hits: `33.3%` - oracle file seeds: - `sorted_heap`: `0.416 ms` - same `63.3%` keyword coverage - same `33.3%` full hits - prompt-derived lexical rerank: - `sorted_heap`: `3.005 ms` - same `63.3%` keyword coverage - worse `16.7%` full hits - oracle keyword rerank: - heap: `2.905 ms` - `sorted_heap`: `2.944 ms` - keyword coverage: `90.0%` - full hits: `66.7%` This is a useful but narrow falsifier: - the plateau is **not** explained by weak file seeds alone - richer local graph structure also did not explain it - a simple prompt-term rerank at `top_k=4` also did not explain it - but once the rerank contract is allowed to use the benchmark's own gold keywords, quality jumps sharply That does **not** justify a product claim, because the oracle rerank is using the same keyword signal that the benchmark later scores. It does justify a more targeted next hypothesis: > the remaining quality frontier on the real code corpus is more likely in the > query/rerank contract or embedding space than in local graph topology or seed > selection ### Result-budget and packing diagnostic The broad "cheap lexical hybrid does not help" claim turned out to be too strong once the same real code-corpus harness was rerun at larger result budgets. Bounded local sweep, same `40`-file / `840`-row / `6`-question corpus, `ann_k=16`, `3` runs: - plain file-seeded `sorted_heap` expansion: - `top_k=4`: `0.402 ms`, `63.3%` keyword coverage, `33.3%` full hits - `top_k=8`: `0.460 ms`, `68.1%` keyword coverage, same `33.3%` full hits - `top_k=16`: `0.469 ms`, `84.3%` keyword coverage, same `33.3%` full hits - `top_k=32`: `0.449 ms`, `94.3%` keyword coverage, `66.7%` full hits - prompt-derived lexical rerank: - `top_k=4`: `3.005 ms`, `63.3%`, `16.7%` - `top_k=8`: `3.176 ms`, `86.7%`, `50.0%` - `top_k=12`: `3.149 ms`, `90.0%`, `66.7%` - `top_k=32`: `3.147 ms`, `96.7%`, `83.3%` So the real code-corpus plateau is not just a seed-quality problem. It is also partly a **result-budget / packing** problem: - with more rows, even the plain file-seeded path recovers much more keyword coverage - prompt-derived lexical rerank starts to help only once the row budget is not extremely tight That makes the next bounded hypothesis more specific: > the remaining small-`top_k` gap is likely about how evidence is packed into a > tiny chunk budget, not about choosing better files One more diagnostic supports that narrower claim. On the original `top_k=4` point, a diversity-aware prompt-term rerank was also tested: - `sorted_heap` prompt-diverse rerank: - `3.229 ms` - `76.7%` keyword coverage - still only `33.3%` full hits That is a partial gain in coverage, but still not the qualitative jump needed to make the current small-budget contract compelling. ### Code-aware embedding diagnostic The next bounded hypothesis was exactly what the code corpus suggests: > maybe the remaining gap is not just about rerank logic, but about the fact > that the current harness still uses a Gutenberg-style lexical tokenizer that > does not understand `CamelCase` or `_snake_case` identifiers well The harness now supports two embedding modes: - `generic` - existing lexical hash over generic text tokens - `code_aware` - keeps the full code token, but also splits identifiers on `_` and `CamelCase` before hashing Stable local comparison on the same real `40`-file / `840`-row / `6`-question point, `ann_k=16`, `top_k=4`, `3` runs: - plain file-seeded `sorted_heap` expansion: - `generic`: `0.450 ms`, `63.3%` keyword coverage, `33.3%` full hits - `code_aware`: `0.427 ms`, `61.4%` keyword coverage, `16.7%` full hits - prompt-diverse rerank: - `generic`: `3.178 ms`, `76.7%`, `33.3%` - `code_aware`: `3.351 ms`, `76.7%`, `50.0%` - oracle keyword rerank: - `generic`: `2.672 ms`, `90.0%`, `66.7%` - `code_aware`: `2.435 ms`, `96.7%`, `83.3%` This is another mixed but useful falsifier: - code-aware tokenization is **not** a free win by itself - plain ANN + file expansion actually got slightly worse - but once combined with a diversity-aware rerank, the same code-aware mode did improve the small-budget `full_pct` So the current code-corpus frontier is now even narrower: > the next likely win is not "better seeds" or "more edges", but a tighter > coupling between code-aware embeddings and a smarter small-budget rerank / > packing contract ### Summary-output packing win The next bounded hypothesis was the most direct one implied by the previous diagnostics: > if the real bottleneck is small-budget packing, then maybe raw chunks are > simply the wrong final output unit for this code benchmark The harness already materializes one summary row per file. The new test keeps the same ANN-derived file seeds, but returns file summaries as the final output rows instead of raw chunks. Stable local result on the same real `40`-file / `840`-row / `6`-question point, `ann_k=16`, `top_k=4`, `3` runs: - generic embedding mode: - chunk output (`seed_file_expand_in`, `sorted_heap`): - `0.418 ms` - `63.3%` keyword coverage - `33.3%` full hits - summary output (`seed_file_summary_output_in`, `sorted_heap`): - `0.200 ms` - `71.0%` keyword coverage - `33.3%` full hits - prompt summary rerank (`prompt_summary_rerank_in`, `sorted_heap`): - `0.318 ms` - `73.3%` keyword coverage - `50.0%` full hits - code-aware embedding mode: - chunk output: - `0.418 ms` - `61.4%` - `16.7%` - summary output: - `0.207 ms` - `77.6%` - `33.3%` - prompt summary rerank: - `0.426 ms` - `77.6%` - `33.3%` This is the first clean small-budget win on the real code corpus: - summary rows are a better packing unit than raw chunks at `top_k=4` - they improve coverage while also reducing latency - in the generic mode, prompt-aware reranking over summaries also improves `full_pct` So the current strongest product-facing hypothesis is no longer "better seeds" or "more graph edges". It is: > for real code GraphRAG, file summaries are a stronger final output unit than > raw chunks when the answer budget is tiny ### Summary rows as seed unit The next narrow question was whether summaries are only a better **output** unit, or also a better **seed** unit. That was tested by forcing the ANN seed step to rank only `REL_FILE_SUMMARY` rows and then keeping the final result set on summaries as well. Stable local result on the same real `40`-file / `840`-row / `6`-question point, `ann_k=16`, `top_k=4`, `3` runs: - generic embedding mode: - summary output from mixed ANN seeds (`seed_file_summary_output_in`, `sorted_heap`): - `0.199 ms` - `71.0%` keyword coverage - `33.3%` full hits - summary output from summary-only seeds (`summary_seed_summary_output_in`, `sorted_heap`): - `0.116 ms` - `77.6%` keyword coverage - `33.3%` full hits - prompt summary rerank from mixed seeds: - `0.329 ms` - `73.3%` - `50.0%` - prompt summary rerank from summary-only seeds: - `0.541 ms` - `74.3%` - `33.3%` - code-aware embedding mode: - mixed-seed summary output: - `0.193 ms` - `77.6%` - `33.3%` - summary-only seed summary output: - `0.112 ms` - `64.3%` - `33.3%` So the current tiny-budget frontier is now split into two clear points: - **fastest coverage point** on this corpus: - generic embedding mode - summary-only seeds - summary output - **best full-hit point** on this corpus: - generic embedding mode - mixed ANN seeds - prompt-aware summary rerank And one more falsifier is now clear: > summary rows are not universally a better seed unit; the benefit depends on > the embedding mode and the final scoring contract ### Summary-plus-chunk hybrid output The next bounded question was whether the best tiny-budget contract should stay purely on summaries, or whether a hybrid output can do better: > use summaries to choose the right files, but also emit one best chunk from > each selected file so the final answer set contains both compressed context > and one concrete code span That was tested in two variants: - mixed ANN seeds -> summary ranking -> one best chunk per selected file - summary-only seeds -> summary ranking -> one best chunk per selected file Stable local result on the same real `40`-file / `840`-row / `6`-question point, `ann_k=16`, `top_k=4`, `3` runs: - generic embedding mode: - best prior full-hit point: mixed-seed prompt summary rerank - `0.363 ms` - `73.3%` - `50.0%` - mixed-seed summary+chunk hybrid: - `1.481 ms` - `84.3%` - `33.3%` - summary-seeded summary+chunk hybrid: - `1.627 ms` - `78.1%` - `50.0%` - code-aware embedding mode: - best prior summary-only point: - prompt summary rerank - `0.372 ms` - `77.6%` - `33.3%` - mixed-seed summary+chunk hybrid: - `1.616 ms` - `84.3%` - `50.0%` - summary-seeded summary+chunk hybrid: - `1.688 ms` - `77.6%` - `33.3%` So this branch narrows the frontier again: - hybrid output is **not** a universal improvement - for the generic mode, pure summary rerank remains the better tiny-budget full-hit point - for the code-aware mode, mixed-seed summary+chunk hybrid is the first path that reaches `50.0%` full hits at `top_k=4` That means the current strongest small-budget choices are now split: - **generic mode**: - summaries-only remain the better contract - **code-aware mode**: - hybrid summary+chunk output is now the better contract ### Fixed-ratio hybrid packing The previous hybrid branch still left one obvious ambiguity: > was the hybrid result about having both summaries and chunks at all, or just > about how many summary slots the tiny `top_k=4` budget reserved? That was tested with two fixed-ratio mixed-seed hybrids: - **summary-light**: `1` summary + `3` chunk slots - **summary-heavy**: `3` summary slots + `1` chunk slot Stable local result on the same real `40`-file / `840`-row / `6`-question point, `ann_k=16`, `top_k=4`, `3` runs: - generic embedding mode: - prior best full-hit point: - prompt summary rerank - `0.337 ms` - `73.3%` - `50.0%` - prior balanced hybrid: - `1.490 ms` - `84.3%` - `33.3%` - summary-light hybrid: - `1.753 ms` - `80.0%` - `33.3%` - summary-heavy hybrid: - `1.057 ms` - `86.7%` - `50.0%` - code-aware embedding mode: - prior best point: - balanced hybrid - `1.566 ms` - `84.3%` - `50.0%` - summary-light hybrid: - `2.246 ms` - `68.1%` - `33.3%` - summary-heavy hybrid: - `0.879 ms` - `84.3%` - `50.0%` This resolves the remaining hybrid ambiguity: - the hybrid win is **not** about chunks in general - it is specifically about reserving a small number of chunk slots while keeping the budget summary-heavy So the refined tiny-budget frontier is now: - **generic mode**: - best latency/full-hit tradeoff: pure prompt summary rerank - best coverage at the same full-hit level: summary-heavy hybrid - **code-aware mode**: - summary-heavy hybrid is now the strongest point ### Summary-heavy hybrid with summary-only seeds The remaining seed question after the fixed-ratio result was very narrow: > if the winning hybrid is already summary-heavy, should its seed unit also be > switched fully to summaries? That was tested directly against the current summary-heavy mixed-seed hybrid. Stable local result on the same real `40`-file / `840`-row / `6`-question point, `ann_k=16`, `top_k=4`, `3` runs: - generic embedding mode: - prompt summary rerank: - `0.395 ms` - `73.3%` - `50.0%` - mixed-seed summary-heavy hybrid: - `1.062 ms` - `86.7%` - `50.0%` - summary-seeded summary-heavy hybrid: - `1.175 ms` - `87.6%` - `50.0%` - code-aware embedding mode: - prompt summary rerank: - `0.390 ms` - `77.6%` - `33.3%` - mixed-seed summary-heavy hybrid: - `0.965 ms` - `84.3%` - `50.0%` - summary-seeded summary-heavy hybrid: - `0.981 ms` - `77.6%` - `33.3%` This closes the seed-unit branch for the current frontier: - **generic mode**: - summary-only seeds can squeeze out a tiny extra coverage gain, but they do not improve full hits and they cost more latency than the mixed-seed summary-heavy hybrid - **code-aware mode**: - summary-only seeds are clearly worse; the mixed-seed summary-heavy hybrid remains the strongest point ### Per-question failure pattern Aggregate percentages were no longer enough to guide the next branch, so the real code-corpus harness now supports targeted diagnostics: - `--case-filter` - `--report-questions` That was used to inspect the current best generic and code-aware contracts on the exact CrossFile prompts from `butler_code_test.cr`. Stable local diagnostic, same `40`-file / `840`-row / `6`-question point, `ann_k=16`, `top_k=4`, `3` runs: - **generic mode** - best latency/full-hit point: - `prompt_summary_rerank_in` - best coverage/full-hit point: - `prompt_summary_chunk_hybrid_s3_in` - **code-aware mode** - best point: - `prompt_summary_chunk_hybrid_s3_in` The important result is not just the percentages, but **which** questions stay hard: - `Response memory policy` - still misses under all current best contracts - current quality stays around `40.0%` - `Streaming overlap` - still misses under all current best contracts - current quality stays around `80.0%` - `Butler response routing` - generic contracts still miss it - code-aware summary-heavy hybrid fixes it to `100.0%` - `Memory store flow` - generic best contracts already solve it - code-aware summary-heavy hybrid still leaves it at `85.7%` This narrows the remaining frontier again: > the next real improvement is likely query-specific or corpus-specific, not a > broad packing or seed policy that helps every question equally The new payload diagnostics make that even more concrete: - `Response memory policy` - current best contracts already pull the **right file neighborhood**: - `memory/hierarchical.cr` - `memory/pgvector.cr` - `memory/external_store.cr` - `butler/persona.cr` - the summary-heavy hybrid even surfaces the `_micro_only` chunk from `memory/hierarchical.cr` - but the remaining miss is about **policy nuance**, not file choice: the returned rows still do not cover the full combination of `_micro_only`, refusal/pollution behavior, and external-storage policy - `Streaming overlap` - current best contracts already pull the correct file: `streaming/controller.cr` - both summary and chunk rows surface the overlap/chunking topic - the remaining miss is about **exact constants / same-file granularity**: the query still does not close the final `1500` / `100` coverage gap So for these two stubborn real prompts, the problem has narrowed from "retrieval picked the wrong files" to a much smaller statement: > the current system is usually choosing the right file region, but not yet the > exact evidence fragment or policy detail needed to close the benchmark ### Same-file local chunk refinement does not rescue the hard prompts The next bounded hypothesis was: > if the right file is already selected, maybe the fix is simply to give the > best file two nearby chunks instead of one That was tested with a new `prompt_summary_chunk_local2_in` case: - keep the summary-heavy contract - keep mixed ANN seeds - replace the single best chunk from the top file with a 2-chunk local window around the best chunk anchor It did **not** help. Targeted hard-prompt rerun (`Response memory policy` + `Streaming overlap`, `ann_k=16`, `top_k=4`, fresh backend): - generic mode: - existing summary-heavy hybrid: - `70.0%` - `0.945-1.050 ms` - local 2-chunk refinement: - `70.0%` - `1.660-1.729 ms` - code-aware mode: - existing summary-heavy hybrid: - `60.0%` - `1.041-1.078 ms` - local 2-chunk refinement: - `60.0%` - `1.689-1.733 ms` Bounded all-question rerun (`40` files, `840` rows, `6` real questions, `3` runs): - generic mode: - existing summary-heavy hybrid: - `0.988 ms` - `86.7%` - `50.0%` - local 2-chunk refinement: - `1.572 ms` - `84.3%` - `33.3%` - code-aware mode: - existing summary-heavy hybrid: - `0.979 ms` - `84.3%` - `50.0%` - local 2-chunk refinement: - `1.523 ms` - `84.3%` - `50.0%` So the next frontier is narrower again: > the missing quality is not solved by a simple "take one more nearby chunk" > policy; the remaining problem is finer-grained evidence choice, not just a > larger same-file window ### Semantic chunk selection is a generic-mode win, but not a universal one The next bounded question was different from the failed local-window branch: > maybe the chunk budget is fine, and the real problem is that the last chunk is > being picked with the wrong scoring rule That was tested with `prompt_summary_chunk_semantic_s3_in`: - keep the current summary-heavy mixed-seed contract - keep the same `3 summaries + 1 chunk` budget - change only the final chunk selection: - old path: lexical-first within the top file - new path: semantic-distance-first within the top file Hard-prompt rerun (`Response memory policy` + `Streaming overlap`, fresh backend, `ann_k=16`, `top_k=4`): - generic mode: - old summary-heavy hybrid: - `70.0%` - `0.992 ms` - semantic chunk selection: - `70.0%` - `0.481 ms` - code-aware mode: - old summary-heavy hybrid: - `60.0%` - `1.035 ms` - semantic chunk selection: - `60.0%` - `0.429 ms` Full `6`-question rerun (`40` files, `840` rows, `3` runs): - generic mode: - old summary-heavy hybrid: - `0.976 ms` - `86.7%` - `50.0%` - semantic chunk selection: - `0.453 ms` - `86.7%` - `50.0%` - code-aware mode: - old summary-heavy hybrid: - `0.975 ms` - `84.3%` - `50.0%` - semantic chunk selection: - `0.474 ms` - `77.6%` - `33.3%` This creates a new mode-specific frontier: - **generic mode** - `prompt_summary_chunk_semantic_s3_in` is now the stronger coverage-preserving hybrid - it keeps the same aggregate quality as the old summary-heavy hybrid while cutting latency by roughly half - **code-aware mode** - the same semantic swap is not acceptable - it buys latency, but loses both coverage and full hits So the next branch should treat the two embedding modes separately instead of assuming one chunk-selection rule can dominate both. ### Prompt-focused file-local snippet extraction The next successful branch stopped changing retrieval at all. Instead of asking the SQL layer to return better rows, it asked a narrower question: > if `prompt_summary_rerank_in` already selects the right files, can we extract > better evidence fragments from those files after retrieval? That is now implemented in the real code-corpus harness as: - `prompt_summary_snippet_py` Contract: - keep the existing `prompt_summary_rerank_in` SQL seed/output path - for each returned summary row, resolve the underlying source file - extract a prompt-focused snippet from the full file using: - prompt-term matching against code-aware line tokens - coverage-greedy anchor selection with method-definition tie-breaks - Crystal method-body expansion instead of fixed-radius windows for selected `def` anchors - adjacent helper-method merge for short `?` helpers referenced by the selected method body - nearby config-initializer merge for short ivar-based helpers, so snippets keep concrete defaults like `window_size=1500` / `overlap=100` - append the snippet to the original summary payload instead of replacing the summary row - cache `(file, prompt)` snippets in-process so repeated runs are measured in both cold and warm regimes This is a **downstream evidence-selection layer**, not a new PostgreSQL query primitive. The main value is answer quality on the real `cogniformerus` CrossFile benchmark. Verified local result on the stable real code-corpus point (`40` files, `840` rows, `6` questions, `384D`, `ann_k=16`, `top_k=4`, fresh backend): - generic embedding mode: - `prompt_summary_rerank_in`: - `p50 0.343-0.392 ms` - `73.3%` keyword coverage - `50.0%` full hits - `prompt_summary_snippet_py`: - warm-cache `p50 0.551-0.698 ms` - cold first-pass `p50 15.316 ms`, `avg 15.435 ms` - `100.0%` keyword coverage - `100.0%` full hits - code-aware embedding mode: - `prompt_summary_rerank_in`: - `p50 0.395-0.398 ms` - `77.6%` - `33.3%` - `prompt_summary_snippet_py`: - warm-cache `p50 0.623 ms` - `97.6%` - `83.3%` - `prompt_symbol_summary_snippet_py`: - warm-cache `p50 0.970-0.989 ms` - `100.0%` - `100.0%` Per-question generic rerun on the same corpus shows what the snippet layer actually fixed: - now solved at `100.0%`: - `Butler response routing` - `Memory store flow` - `Two-stage answering` - `NLU hybrid classification` - `Response memory policy` - `Streaming overlap` Per-question code-aware rerun with `prompt_symbol_summary_snippet_py` now also solves the full set at `100.0%`, including the old remaining miss: - `Memory store flow` Interpretation: - the remaining plateau on this real code corpus was **not** primarily file retrieval - it was a file-local evidence selection problem - the last code-aware miss turned out to be a **seed-ranking problem inside the summary path**, not a snippet-window problem - `HierarchicalMemory` was already present in the summary row for `memory/hierarchical.cr` - the fix was a bounded symbol-aware variant, `prompt_symbol_summary_snippet_py`, which: - extracts exact prompt symbols like `HierarchicalMemory`, `TwoStageAnswerer`, `DialogueNLU` - unions a tiny exact-symbol summary seed set with the existing ANN seeds - ranks summary rows by `symbol_hits` before the older prompt-term score - the strongest fix was not "wider windows" - it was preserving summary rows while adding code-structured snippets underneath them - prompt-focused snippet extraction is the first branch that moves the real code-corpus benchmark from `50.0%` to `100.0%` full hits at the same tiny-budget `top_k=4` - the current frontier is now split by embedding mode: - generic: `prompt_summary_snippet_py` remains the better latency point - code-aware: `prompt_symbol_summary_snippet_py` is the quality winner Important caveat: - the warm numbers rely on an in-process `(file, prompt)` snippet cache - the cold first-pass cost is still materially higher than pure SQL rerank - so this is a quality-oriented contract, not a free latency win - the symbol-aware variant is not a generic improvement: - in generic mode it gives no quality lift and only adds cost That code-corpus frontier is now also checked under a repeated-build protocol: - [`scripts/repeat_graph_rag_code_corpus_builds.py`](../scripts/repeat_graph_rag_code_corpus_builds.py) - `3` independent fresh temp-cluster builds - local `facts_sh` only, same stable point: - `384D` - `ann_k=16` - `top_k=4` - `ef_search=64` - `ef_construction=200` - `m=24` - fresh backend Verified repeated-build result: - generic: - `prompt_summary_snippet_py` - `p50 median 0.613 ms`, range `0.543-0.632 ms` - stable `100.0% / 100.0%` - `prompt_symbol_summary_snippet_py` - `p50 median 0.986 ms`, range `0.932-1.047 ms` - same `100.0% / 100.0%` - therefore strictly slower on the generic frontier - code-aware: - `prompt_summary_snippet_py` - `p50 median 0.612 ms`, range `0.602-0.629 ms` - stable `97.6% / 83.3%` - `prompt_symbol_summary_snippet_py` - `p50 median 0.963 ms`, range `0.928-1.022 ms` - stable `100.0% / 100.0%` Interpretation: - the new symbol-aware code-aware win is **build-stable**, not a one-off lucky HNSW construction - the generic frontier is also build-stable, and the symbol-aware case remains dominated there That same repeated-build protocol was then rerun on an AWS ARM64 host (`4 vCPU`, `8 GiB RAM`) using: - [`scripts/repeat_graph_rag_code_corpus_builds_aws.sh`](../scripts/repeat_graph_rag_code_corpus_builds_aws.sh) - the same `3` fresh builds - the same minimal synced `cogniformerus` source tree and `butler_code_test.cr` prompt set Verified AWS repeated-build result: - generic: - `prompt_summary_snippet_py` - `p50 median 0.955 ms`, range `0.954-0.960 ms` - stable `100.0% / 100.0%` - `prompt_symbol_summary_snippet_py` - `p50 median 1.485 ms`, range `1.473-1.487 ms` - same `100.0% / 100.0%` - still strictly slower on the generic frontier - code-aware: - `prompt_summary_snippet_py` - `p50 median 1.008 ms`, range `1.008-1.009 ms` - stable `97.6% / 83.3%` - `prompt_symbol_summary_snippet_py` - `p50 median 1.541 ms`, range `1.537-1.557 ms` - stable `100.0% / 100.0%` So the code-aware split is now **cross-environment verified**: - generic keeps the older snippet contract - code-aware keeps the symbol-aware snippet contract - the change in winner is not a local Apple-only artifact ## Larger in-repo `cogniformerus` transfer gate The previous repeated-build result used the smaller synced `cogniformerus/src/cogniformerus` slice (`40` files, `840` rows after summary + chunk expansion). That was a good stable benchmark, but it was still fair to ask whether the contract would survive a materially larger in-repo code corpus. The next bounded adversary check therefore reran the same repeated-build protocol on the full `cogniformerus` repository: - source tree: `~/Projects/Crystal/cogniformerus` - file count: `183` Crystal files - prompt set: the same real `butler_code_test.cr` CrossFile prompts - same ANN knobs: - `384D` - `ann_k=16` - `ef_search=64` - `ef_construction=200` - `m=24` The old tiny-budget point (`top_k=4`) did **not** transfer cleanly: - generic `prompt_summary_snippet_py` - `p50 0.770 ms` - `87.1%` keyword coverage - `66.7%` full hits - `avg_rows 3.67` - code-aware `prompt_symbol_summary_snippet_py` - `p50 1.824 ms` - `87.6%` keyword coverage - `66.7%` full hits - `avg_rows 4.00` That is a real transfer gap, but it is **not** the same kind of failure as the external `folding/src` miss. The next bounded hypothesis was simply to raise the final result budget while keeping the same seed contract and the same winner cases. At `top_k=8`, `3` fresh builds gave: - generic `prompt_summary_snippet_py` - `p50 median 0.819 ms`, range `0.794-0.855 ms` - stable `100.0% / 100.0%` - `avg_rows 6.33` - code-aware `prompt_symbol_summary_snippet_py` - `p50 median 1.814 ms`, range `1.669-2.101 ms` - stable `100.0% / 100.0%` - `avg_rows 7.50` So the larger in-repo Crystal-side transfer gate is now verified. The honest correction is: - the current real code-corpus winners are **not** universal at the old `top_k=4` budget - on the full in-repo corpus, they need a slightly larger final result budget - once that budget moves to `top_k=8`, the current winners recover perfectly without needing a new seed or snippet contract That narrows the remaining `0.13` real-corpus gap further: - `~/Projects/Crystal` now has both the small stable slice and a larger full-repo transfer gate - the next unverified generalization work was the mixed-language / archive side (`~/Projects/C`, `~/SrcArchives`) ## Mixed-language `~/Projects/C` adversary gate (`pycdc`) The next release-hardening branch widened the code-corpus harness itself: - JSON question fixtures are now supported - source discovery is no longer hardcoded to `*.cr` - local dependency edges now also understand quoted C/C++ includes: `#include "..."` -> `REQUIRES_FILE` That made it possible to run the same narrow code-GraphRAG benchmark shape on a real mixed-language corpus under `~/Projects/C` without inventing a separate harness family. The first such corpus was `pycdc`: - source tree: `~/Projects/C/pycdc` - fixture: `scripts/fixtures/graph_rag_pycdc_questions.json` - source extensions: - `.h` - `.cpp` - `.txt` - `.markdown` - corpus size: - `138` files - `1281` rows after summary + chunk expansion - `72` local dependency edges from quoted includes The first smoke run already gave the key split: - generic `prompt_summary_snippet_py` - `75.0%` keyword coverage - `40.0%` full hits - generic `prompt_symbol_summary_snippet_py` - `90.0%` - `60.0%` - code-aware `prompt_summary_snippet_py` - `70.0%` - `60.0%` - code-aware `prompt_compactseed_require_summary_snippet_fn` - `100.0%` - `100.0%` That already falsified the lazy story that mixed-language transfer would look just like the Crystal corpora with only file-summary rerank. On `pycdc`, the include-aware rescue path matters much more. Repeated-build verification at `top_k=8`, `3` fresh builds, then gave: - generic `prompt_symbol_summary_snippet_py` - `p50 median 0.850 ms`, range `0.825-1.118 ms` - stable `90.0% / 60.0%` - `avg_rows 6.40` - code-aware `prompt_compactseed_require_summary_snippet_fn` - `p50 median 8.006 ms`, range `7.799-8.136 ms` - stable `100.0% / 100.0%` - `avg_rows 5.80` So the first real `~/Projects/C` gate is now covered, but it does **not** produce the same frontier as the Crystal corpora: - there is no equally cheap generic `100.0% / 100.0%` point here - the quality-complete point currently needs the slower helper-backed compact lexical seed + include rescue That still narrows the `0.13` release gap meaningfully: - `~/Projects/Crystal` is covered - `~/Projects/C` is covered - the remaining unverified archive-side gate is now `~/SrcArchives` ## Archive-side `~/SrcArchives` gate (`ninja/src`) The last remaining real-corpus gap named in the `0.13` plan was the archive side under `~/SrcArchives`. The new mixed-language harness path made it possible to cover that without another code change, so the next adversary corpus was: - source tree: `~/SrcArchives/apple/ninja/src` - fixture: `scripts/fixtures/graph_rag_ninja_questions.json` - source extensions: - `.h` - `.cc` - corpus size: - `103` files - `1757` rows after summary + chunk expansion - `282` local dependency edges from quoted includes The first smoke at the current default-ish budget (`top_k=8`) already gave a useful signal: - generic `prompt_summary_snippet_py` - `95.0%` keyword coverage - `80.0%` full hits - code-aware `prompt_summary_snippet_py` - `85.0%` - `80.0%` That differed from `pycdc` in an important way: - the archive corpus was already close on the plain generic path - the code-aware path was not stronger here - there was no immediate evidence that a dependency-rescue branch was needed The cheapest falsifier was therefore not a new query contract, but just a small increase in the final result budget. At `top_k=12`: - generic `prompt_summary_snippet_py` - `100.0% / 100.0%` - `p50 0.996 ms` on the first smoke - code-aware `prompt_summary_snippet_py` - stayed at `85.0% / 80.0%` Repeated-build verification (`3` fresh builds) then confirmed the archive-side winner: - generic `prompt_summary_snippet_py` - `p50 median 0.914 ms`, range `0.827-0.921 ms` - stable `100.0% / 100.0%` - `avg_rows 7.80` - code-aware `prompt_summary_snippet_py` - `p50 median 0.871 ms`, range `0.848-0.901 ms` - stable `85.0% / 80.0%` - `avg_rows 7.60` So the archive-side gate is now covered, and the conclusion is pleasantly narrow: - `~/SrcArchives` does not require a new rescue contract for the first verified corpus - the simple generic summary-snippet path closes `ninja/src` - the only change needed versus the smaller code-corpus points was a small result-budget bump from `top_k=8` to `top_k=12` This means the `0.13` larger real-corpus verification matrix is now complete in the scoped sense the plan asked for: - `~/Projects/Crystal` - `~/Projects/C` - `~/SrcArchives` ## External folding corpus check The next adversary check was a second real code corpus outside this repository: - source tree: `folding/src` - prompt set: `butler_folding_test.cr` This surfaced one real harness bug first: - [`scripts/bench_graph_rag_code_corpus.py`](../scripts/bench_graph_rag_code_corpus.py) originally globbed `*.cr` paths without filtering `is_file()` - on the `folding` tree that accidentally picked up `.crystal-cache` directories ending in `.cr` - the harness now filters to real files only Once that was fixed, the external corpus produced a useful repeated-build result. Local `3`-build protocol on `facts_sh`, generic mode, same small-budget point (`384D`, `ann_k=16`, `top_k=4`, `ef_search=64`, `ef_construction=200`, `m=24`, fresh backend): - `prompt_summary_snippet_py` - `p50 median 1.048 ms`, range `0.913-4.141 ms` - quality drifted across fresh builds: `90.5-100.0%` keyword coverage, `83.3-100.0%` full hits - `prompt_lexseed_require_summary_snippet_fn` - the first non-oracle rescue to `100.0% / 100.0%` - but under a colder repeated-build protocol it turned out to be much more expensive than the earlier one-build numbers suggested: `p50 median 28.266 ms`, range `26.887-30.698` - `prompt_compactseed_require_summary_snippet_fn` - `p50 median 5.940 ms`, range `5.914-6.128` - stable `100.0% / 100.0%` - `oracle_prompt_summary_snippet_py` - on a bounded full rerun it also stayed at `100.0% / 100.0%`, but the non-oracle compact-seed rescue already matches that quality, so oracle seeds are no longer the interesting external-generic diagnostic Interpretation: - the old claim that generic external folding was already solved by `prompt_summary_snippet_py` was too strong - the generic baseline is now clearly less robust on this corpus than on the in-repo `cogniformerus` slice - the first full-summary lexical rescue proved that the external gap was solvable, but it was too expensive to be a real frontier - the stronger branch was a **different lexical-seed representation**: a compact per-file seed table built from file path terms, require-target terms, and deduplicated summary tokens - that compact-seed rescue still closes the quality gap to `100.0% / 100.0%`, but cuts the old full-summary lexical rescue by about `4.8x` locally An isolated timing split then narrowed where that penalty actually sits. On a fresh local `3`-run sweep of only the old full-summary helper-backed rescue: - generic `prompt_lexseed_require_summary_snippet_fn` - `avg fetch ms/query = 10.674` - `avg postprocess ms/query = 8.033` - `24` snippet-cache misses, `48` hits - `avg build time per miss = 6.010 ms` - code-aware `prompt_lexseed_require_summary_snippet_fn` - `avg fetch ms/query = 11.016` - `avg postprocess ms/query = 7.742` - `24` snippet-cache misses, `48` hits - `avg build time per miss = 5.787 ms` So the external rescue is **not** primarily a snippet-extraction problem. Even on the isolated cold pass, the dominant term is still the lexical-seed + `REQUIRES_FILE` fetch path. Snippet generation is a real secondary tax on the first pass, but it is not where the largest win now sits. A kept-temp-cluster component probe narrowed that one step further. On the same external `folding/src` corpus: - `ann` alone was cheap: about `0.51 ms` median across the 6 real prompts - `lexical_seed` alone was the real dominant stage: about `9.34 ms` median - `rescue_require` landed at about `9.28 ms` median because it inherits the same lexical-seed cost - `rescue_lexical_require_summaries` was about `9.86 ms` median The summary rows explain why this stage is expensive: `REL_FILE_SUMMARY` payload length was `80 / 2078 / 5441` bytes at `min / median / max` on the external corpus. So the rescue is paying to run prompt-term substring scoring against multi-kilobyte summary payloads even before snippet extraction starts. The same external `folding/src` corpus also answered the code-aware question. At the same repeated-build point: - code-aware `prompt_summary_snippet_py` - `p50 median 1.080 ms`, range `1.048-1.146 ms` - stable `79.8% / 66.7%` - code-aware `prompt_lexseed_require_summary_snippet_fn` - `p50 median 36.676 ms`, range `29.806-40.705` - stable `100.0% / 100.0%` - code-aware `prompt_compactseed_require_summary_snippet_fn` - `p50 median 5.804 ms`, range `5.776-6.510` - stable `100.0% / 100.0%` - code-aware `oracle_prompt_summary_snippet_py` - `p50 median 1.217 ms`, range `1.149-1.303 ms` - stable `100.0% / 100.0%` So the external folding split is now sharper: - both **generic** and **code-aware** external folding now have a verified non-oracle rescue to `100.0% / 100.0%` - the external problem really was a **seed-representation problem**, not a snippet extraction problem - the current external default is the compact-seed rescue, not the old full-summary lexical rescue - the old full-summary rescue is now useful mainly as a diagnostic anchor for why the compact representation matters - the honest conclusion is narrower: - external folding is no longer blocked by an unsolved quality gap - it still pays a quality/latency tax relative to the primary `cogniformerus` code corpus, but that tax is now much smaller than before That local result also transferred to AWS ARM64 (`4 vCPU`, `8 GiB RAM`) under a fresh `3`-build repeated-build protocol: - generic `prompt_summary_snippet_py` - `p50 median 1.540 ms`, range `1.535-1.604 ms` - stable `90.5% / 83.3%` - generic `prompt_lexseed_require_summary_snippet_fn` - `p50 median 41.960 ms`, range `41.747-42.081` - stable `100.0% / 100.0%` - generic `prompt_compactseed_require_summary_snippet_fn` - `p50 median 8.839 ms`, range `8.732-8.846` - stable `100.0% / 100.0%` - code-aware `prompt_summary_snippet_py` - `p50 median 1.775 ms`, range `1.729-1.836 ms` - stable `79.8% / 66.7%` - code-aware `prompt_lexseed_require_summary_snippet_fn` - `p50 median 60.413 ms`, range `60.298-60.660` - stable `100.0% / 100.0%` - code-aware `prompt_compactseed_require_summary_snippet_fn` - `p50 median 8.392 ms`, range `8.329-8.413` - stable `100.0% / 100.0%` So the compact-seed external rescue is now **cross-environment verified**, not a local artifact. The speedup over the old full-summary lexical rescue also survives the environment change: - generic: `41.960 ms -> 8.839 ms` - code-aware: `60.413 ms -> 8.392 ms` The external rescue is still slower than the primary in-repo winners, but it is no longer "full-quality only at tens of milliseconds". The next honest optimization target therefore changed. Cheap seed-budget cuts were already falsified (`ann_k < 16` and lexical-seed `LIMIT 1` both got worse), and the timing split shows that further work should focus on reducing the old full-summary lexical-seed cost. The compact lexical seed table already eliminated most of that cost, so the next branch is no longer "make lexical seeding viable at all"; it is whether the compact representation can be pushed closer to the primary in-repo code-corpus frontier. One obvious branch was also falsified directly: truncating lexical scoring to a summary prefix. On the external corpus: - `left(payload, 512)` dropped the rescue query to about `7.9 ms`, but quality fell back to `96.7% / 83.3%` - `left(payload, 1024)` restored `100.0% / 100.0%`, but it no longer sped the query up - the narrower threshold sweep (`640..992`) confirmed there was no useful middle ground: `992` bytes recovered `100.0% / 100.0%`, but was still slower than the full-payload rescue So a naive prefix cut is now a documented dead end. The remaining work is not "look at less text in the same way"; it needs a different lexical-seed representation or a different seed-selection contract altogether. ## March 26, 2026: `sorted_hnsw.shared_cache` GraphRAG branch A new bounded speed branch looked promising for fact-shaped GraphRAG: turning `sorted_hnsw.shared_cache` on for the ANN seed step. A direct local probe on a `2K x 384D` multihop graph reduced the path-aware wrapper from roughly `0.911 ms` total to `0.623 ms`, with most of the gain in the ANN stage. That did **not** survive the reliability gate. On the full local `5K`-pair, `64`-query multihop harness, keeping the same quality knobs (`ann_k=64`, `ef_search=128`, `ef_construction=200`, `m=24`) but switching only `sorted_hnsw.shared_cache` from `off` to `on` caused all `facts_sh` ANN-seeded rows to collapse to `0.0% / 0.0%`, while the `facts_heap` baseline stayed correct in the same run. The strongest evidence from this branch is: - the simple direct ANN seed query on `facts_sh` still returned the expected top rows with `shared_cache=on` - single-query GraphRAG probes could still look correct - the failure only showed up on the **full** same-session multihop harness, which points to a cache lifecycle / reuse bug rather than a general GraphRAG scoring bug So the current honest conclusion is narrow: - `sorted_hnsw.shared_cache = on` remains a **promising** performance branch for GraphRAG seed scans - it is **not** currently safe as the default GraphRAG benchmark or release operating point - the benchmark harnesses now expose a `--shared-cache on|off` switch, but the default stays `off` until this correctness issue is debugged and fixed