---
layout: default
title: Benchmarks
nav_order: 6
---

# Benchmarks

All benchmarks run on PostgreSQL 18, Apple M-series (12 CPU, 64 GB RAM)
with `shared_buffers=4GB`, `work_mem=256MB`, `maintenance_work_mem=2GB`.
Warm cache, average of 5 runs.

---

## EXPLAIN ANALYZE -- query latency and buffer reads

### 1M rows (71 MB)

| Query | sorted_heap | heap + btree | heap seqscan |
|-------|------------|-------------|-------------|
| Point (1 row) | 0.035 ms / 1 buf | 0.046 ms / 7 bufs | 15.2 ms / 6,370 bufs |
| Narrow (100 rows) | 0.043 ms / 2 bufs | 0.067 ms / 8 bufs | 16.2 ms / 6,370 bufs |
| Medium (5K rows) | 0.434 ms / 33 bufs | 0.492 ms / 52 bufs | 16.1 ms / 6,370 bufs |
| Wide (100K rows) | 7.5 ms / 638 bufs | 8.9 ms / 917 bufs | 17.4 ms / 6,370 bufs |

### 10M rows (714 MB)

| Query | sorted_heap | heap + btree | heap seqscan |
|-------|------------|-------------|-------------|
| Point (1 row) | 0.034 ms / 1 buf | 0.047 ms / 7 bufs | 117.9 ms / 63,695 bufs |
| Narrow (100 rows) | 0.037 ms / 1 buf | 0.062 ms / 7 bufs | 130.9 ms / 63,695 bufs |
| Medium (5K rows) | 0.435 ms / 32 bufs | 0.549 ms / 51 bufs | 131.0 ms / 63,695 bufs |
| Wide (100K rows) | 7.6 ms / 638 bufs | 8.8 ms / 917 bufs | 131.4 ms / 63,695 bufs |

### 100M rows (7.8 GB)

| Query | sorted_heap | heap + btree | heap seqscan |
|-------|------------|-------------|-------------|
| Point (1 row) | 0.045 ms / **1 buf** | 0.506 ms / 8 bufs | 1,190 ms / 519,906 bufs |
| Narrow (100 rows) | 0.166 ms / 2 bufs | 0.144 ms / 9 bufs | 1,325 ms / 520,782 bufs |
| Medium (5K rows) | 0.479 ms / 38 bufs | 0.812 ms / 58 bufs | 1,326 ms / 519,857 bufs |
| Wide (100K rows) | 7.9 ms / 737 bufs | 10.1 ms / 1,017 bufs | 1,405 ms / 518,896 bufs |

At 100M rows, a point query reads **1 buffer** (vs 8 for btree, 519,906 for
sequential scan).

---

## pgbench throughput (TPS)

### Prepared mode (`-M prepared`)

Query planned once, re-executed with parameters. 10 s, 1 client.

| Query | 1M (sh / btree) | 10M (sh / btree) | 100M (sh / btree) |
|-------|----------------:|------------------:|-------------------:|
| Point | 46.9K / 59.4K | 46.5K / 58.0K | 32.6K / 43.6K |
| Narrow | 22.3K / 29.1K | 22.5K / 28.8K | 17.9K / 18.1K |
| Medium | 3.4K / 5.1K | 3.4K / 4.8K | 2.4K / 2.4K |
| Wide | 295 / 289 | 293 / 286 | 168 / 157 |

### Simple mode (`-M simple`)

Each query parsed, planned, and executed separately. 10 s, 1 client.

| Query | 1M (sh / btree) | 10M (sh / btree) | 100M (sh / btree) |
|-------|----------------:|------------------:|-------------------:|
| Point | 28.4K / 38.0K | 29.1K / 41.4K | **18.7K / 4.6K** |
| Narrow | 19.6K / 24.4K | 21.8K / 27.6K | **7.1K / 5.5K** |
| Medium | 3.1K / 3.7K | 3.4K / 4.8K | **2.1K / 1.6K** |
| Wide | 198 / 290 | 200 / 286 | **163 / 144** |

At 100M rows in simple mode, sorted_heap wins all query types. Point queries
reach 18.7K TPS vs 4.6K for btree (4x).

---

## INSERT and compaction throughput

| Scale | sorted_heap | heap + btree | heap (no index) | compact time |
|------:|------------:|-----------:|--------------:|-----------:|
| 1M | 923K rows/s | 961K | 1.91M | 0.3 s |
| 10M | 908K rows/s | 901K | 1.65M | 3.1 s |
| 100M | 840K rows/s | 1.22M | 2.22M | 41.3 s |

### Storage

| Scale | sorted_heap | heap + btree | heap (no index) |
|------:|------------:|------------:|--------------:|
| 1M | 71 MB | 71 MB (50 + 21) | 50 MB |
| 10M | 714 MB | 712 MB (498 + 214) | 498 MB |
| 100M | 7.8 GB | 7.8 GB (5.7 + 2.1) | 5.7 GB |

sorted_heap stores data + zone map in the same space as a heap table.
The zone map replaces the btree index, so total storage is comparable to
heap-without-index -- roughly 30% less than heap + btree at scale.

---

## Vector search

### Current local synthetic benchmark (`sorted_hnsw` Index AM)

Repo-owned harnesses:

- `python3 scripts/bench_gutenberg_local_dump.py --dump /tmp/cogniformerus_backup/cogniformerus_backup.dump --port 65473`
- `REMOTE_PYTHON=/path/to/python SH_EF=32 EXTRA_ARGS='--sh-ef-construction 200' ./scripts/bench_gutenberg_aws.sh <aws-host> /path/to/repo /path/to/dump 65485`
- `scripts/bench_sorted_hnsw_vs_pgvector.sh /tmp 65485 10000 20 384 10 vector 64 96`
- `python3 scripts/bench_ann_real_dataset.py --dataset nytimes-256 --sample-size 10000 --queries 20 --k 10 --pgv-ef 64 --sh-ef 96 --zvec-ef 64 --qdrant-ef 64`
- `python3 scripts/bench_qdrant_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64`
- `python3 scripts/bench_zvec_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64`

### Current AWS restored-corpus benchmark (`~104K x 2880D`, Gutenberg dump)

AWS ARM64 host (4 vCPU, 8 GiB RAM), top-10,
restored PostgreSQL custom dump. Ground truth is recomputed by exact heap
search on the restored `svec` table. In the current rerun the stored
`bench_hnsw_gt` table matched the exact heap GT on 100% of the 50 benchmark
queries, so the fresh exact heap GT and the historical GT table agree. This
rerun uses `sorted_hnsw` `ef_construction=200` and `ef_search=32`, and the
harness reconnects after build before timing ordered scans.

| Method | p50 latency | Recall@10 | Notes |
|--------|:-----------:|:---------:|-------|
| Exact heap (`svec`) | 458.762 ms | 100.0% | brute-force GT on restored corpus |
| **`sorted_hnsw` (`svec`)** | **1.287 ms** | **100.0%** | `ef_construction=200`, `ef_search=32`, index 404 MB, total 1902 MB |
| `sorted_hnsw` (`hsvec`) | 1.404 ms | 100.0% | `ef_construction=200`, `ef_search=32`, index 404 MB, total 1032 MB |
| pgvector HNSW (`halfvec`) | 2.031 ms | 99.8% | `ef_search=64`, index 804 MB, total 1615 MB |
| zvec HNSW | 50.499 ms | 100.0% | in-process collection, `ef=64`, ~1.12 GiB on disk |
| Qdrant HNSW | 6.028 ms | 99.2% | local Docker on same AWS host, `hnsw_ef=64`, 103,260 points |

The precision-matched PostgreSQL comparison on Gutenberg is now
`sorted_hnsw (hsvec)` vs pgvector `halfvec`: `1.404 ms @ 100.0%` versus
`2.031 ms @ 99.8%`, with total footprint `1032 MB` versus `1615 MB`. The raw
fastest PostgreSQL row on this corpus is still `sorted_hnsw (svec)` at
`1.287 ms`, but that uses float32 source storage. `sorted_hnsw` keeps the same
404 MB index in both cases because the AM stores SQ8 graph state; the storage
gain from `hsvec` appears in the base table and TOAST footprint instead.

Synthetic 10K x 384D cosine corpus, top-10, warm query loop. PostgreSQL
methods were rerun across 3 fresh builds and the table below reports median
`p50` / median recall. Qdrant uses 3 warm measurement passes on one local
Docker collection.

| Method | p50 latency | Recall@10 | Notes |
|--------|:-----------:|:---------:|-------|
| Exact heap (`svec`) | 2.03 ms | 100% | Brute-force ground truth |
| **sorted_hnsw** | **0.158 ms** | **100%** | `shared_cache=on`, `ef_search=96`, index ~5.4 MB |
| pgvector HNSW (`vector`) | 0.446 ms | 90% median (90-95 range) | `ef_search=64`, same `M=16`, `ef_construction=64`, index ~2.0 MB |
| zvec HNSW | 0.611 ms | 100% | local in-process collection, `ef=64` |
| Qdrant HNSW | 1.94 ms | 100% | local Docker, `hnsw_ef=64` |

### Current local real-dataset sample (`nytimes-256-angular`)

Repo-owned harness:

- `python3 scripts/bench_ann_real_dataset.py --dataset nytimes-256 --sample-size 10000 --queries 20 --k 10 --pgv-ef 64 --sh-ef 96 --zvec-ef 64 --qdrant-ef 64`

ANN-Benchmarks `nytimes-256-angular`, sampled to 10K base vectors and 20
queries, top-10. The table below reports medians across 3 full harness runs.
Ground truth comes from exact PostgreSQL heap search on the sampled `svec`
corpus.

| Method | p50 latency | Recall@10 | Notes |
|--------|:-----------:|:---------:|-------|
| Exact heap (`svec`) | 1.557 ms | 100% | brute-force ground truth |
| **sorted_hnsw** | **0.327 ms** | **85.0% median** (83.5-85.5 range) | `shared_cache=on`, `ef_search=96`, index ~4.1 MB |
| pgvector HNSW (`vector`) | 0.751 ms | 79.0% median (78.5-79.0 range) | `ef_search=64`, same `M=16`, `ef_construction=64`, index ~13 MB |
| zvec HNSW | 0.403 ms | 99.5% | local in-process collection, `ef=64`, ~14.1 MB on disk |
| Qdrant HNSW | 1.704 ms | 99.5% | local Docker, `hnsw_ef=64` |

This corpus is materially harder than the deterministic synthetic one. It is a
better signal for default-parameter recall, while the synthetic table remains
useful for controlled same-host engine comparisons and regression tracking.

### Current local GraphRAG benchmark (`person -> parent -> city`, stable fact contract)

Repo-owned harness:

- `python3 scripts/bench_graph_rag_multihop.py --num-pairs 5000 --query-count 64 --runs 3 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 --pgv-ef-search 64 --zvec-ef 64 --qdrant-ef 64 --shared-buffers-mb 64 --backend-mode fresh`

Deterministic fact graph, `5K` chains / `10K` rows total, `384D`, top-10.
This is the current balanced stable GraphRAG point for the narrow fact-shaped
fact-retrieval workload. The stable-facing SQL entry point is
`sorted_heap_graph_rag(...)`; the table below also shows the underlying helper
and wrapper paths because the harness measures the dispatched execution path
directly.

| Method | p50 latency | hit@1 | hit@k | Notes |
|--------|:-----------:|:-----:|:-----:|-------|
| Heap two-hop SQL | 0.762 ms | 75.0% | 96.9% | exact rerank over expanded heap set |
| sorted_heap_graph_rag_twohop_scan() | 0.720 ms | 75.0% | 96.9% | old city-only rerank contract |
| sorted_heap_expand_twohop_rerank() | 0.712 ms | 75.0% | 96.9% | same city-only seed point |
| sorted_heap SQL `pathsum` baseline | 0.847 ms | 98.4% | 98.4% | `hop1_distance + hop2_distance` in SQL |
| sorted_heap_graph_rag_twohop_path_scan() | 0.739 ms | 98.4% | 98.4% | fused path-aware wrapper |
| **sorted_heap_expand_twohop_path_rerank()** | **0.726 ms** | **98.4%** | **98.4%** | same knobs, fused path-aware helper |
| pgvector HNSW + heap expansion | 2.588 ms | 90.6% | 90.6% | path-aware rerank, `ef_search=64` |
| zvec HNSW + heap expansion | 2.507 ms | 100.0% | 100.0% | path-aware rerank, `ef=64` |
| Qdrant HNSW + heap expansion | 4.947 ms | 100.0% | 100.0% | path-aware rerank, `hnsw_ef=64` |

The path-aware helper changes the local conclusion materially: the dominant
quality issue on this fact-shaped workload was the old hop-2-only rerank
contract, not seed ANN quality. The fused path-aware helper now gives the best
local latency/quality point at the same `m=24`, `ef_construction=200`,
`ann_k=64`, `ef_search=128` operating point.

Under the same path-aware scorer contract, the current local conclusion gets
sharper: `sorted_heap` keeps the latency lead, while `zvec` and Qdrant reach
the strongest observed answer quality on this deterministic fact graph.

A repeated-build local protocol then quantified how much of this is just
single-run luck. Using three independent fresh builds of the same `5K`/`384D`
balanced point:

- `sorted_heap_expand_twohop_path_rerank()` median `0.798 ms`, range
  `0.771-0.819 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%` on every build
- `sorted_heap_graph_rag_twohop_path_scan()` median `0.796 ms`, range
  `0.778-0.804 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%` on every build
- `pgvector` path-aware parity row median `1.405 ms`, `hit@1/hit@k`
  `85.9-89.1%`
- `zvec` path-aware parity row median `1.076 ms`, `100.0% / 100.0%`
- `Qdrant` path-aware parity row median `2.799 ms`, `100.0% / 100.0%`

So on the local balanced point, the current `sorted_heap` path-aware rows are
not a fragile one-off. The latency band is tight, and the answer quality did
not drift across the three rebuilds.

### `shared_cache=on` vs `off` for multihop (post-fix, 0490cc4)

A multi-index shared-cache corruption bug was fixed in commit `0490cc4`.
Previously, when two or more `sorted_hnsw` indexes existed in the same
database and `shared_cache=on`, a publish for the second index could
overwrite shared memory that an attached cache for the first index was
still reading through bare pointers. This caused silent data corruption
and 0% retrieval quality on the affected index.

The fix deep-copies all data (L0 neighbors, SQ8 vectors, upper-level
neighbor slabs) from shared memory into local palloc'd buffers on attach.
The exact failure mode is regression-guarded by the multi-index overwrite
phase (B5) in `scripts/test_hnsw_chunked_cache.sh`.

Post-fix verification on the `5K x 384D` multihop benchmark (fresh
backends, 3 runs, `ann_k=64`, `ef_search=128`, `m=24`):

```
python3 scripts/bench_graph_rag_multihop.py \
  --num-pairs 5000 --dim 384 --query-count 64 --runs 3 \
  --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 \
  --shared-cache {on,off} --backend-mode fresh \
  --skip-zvec --skip-qdrant --skip-pgvector
```

| Method | shared_cache | p50 | hit@1 | hit@k |
|--------|:----------:|:------:|:-----:|:-----:|
| sorted_heap_expand_twohop_path_rerank() | off | 0.780 ms | 98.4% | 98.4% |
| sorted_heap_expand_twohop_path_rerank() | on | 0.766 ms | 98.4% | 98.4% |
| sorted_heap_graph_rag_twohop_path_scan() | off | 0.804 ms | 98.4% | 98.4% |
| sorted_heap_graph_rag_twohop_path_scan() | on | 0.786 ms | 98.4% | 98.4% |

Also verified at `10K x 384D`:

```
python3 scripts/bench_graph_rag_multihop.py \
  --num-pairs 10000 --dim 384 --query-count 64 --runs 3 \
  --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 \
  --shared-cache {on,off} --backend-mode fresh \
  --skip-zvec --skip-qdrant --skip-pgvector
```

| Method | shared_cache | p50 | hit@1 | hit@k |
|--------|:----------:|:------:|:-----:|:-----:|
| sorted_heap_graph_rag_twohop_path_scan() | off | 0.904 ms | 95.3% | 96.9% |
| sorted_heap_graph_rag_twohop_path_scan() | on | 0.972 ms | 95.3% | 96.9% |

Quality is identical between `on` and `off` at both measured scales.
Latency at these measured scales is mixed rather than a clear universal win.
`shared_cache=off` is no longer needed as a correctness workaround.

**Larger-scale cold-start measurement (`100K x 32D`, 200K rows, 47 MB index):**

The deep-copy fix means attach now copies all L0 neighbors (~38 MB), SQ8
data (~6 MB), and node metadata (~6 MB) from shared memory into local
buffers on each fresh backend's first query. At 200K nodes, this upfront
cost exceeds the lazy page-decode path used by `shared_cache=off`:

```
# 100K pairs, 32D, m=24, ef_construction=200, ef_search=128, 20 fresh backends per mode
# PG buffer cache warm (shared_buffers=256MB):
shared_cache=off: cold KNN p50=6.31ms
shared_cache=on:  cold KNN p50=8.68ms

# PG buffer cache cold (PG restarted between modes, shared_buffers=64MB):
shared_cache=off: cold KNN first=5.3ms  p50=5.7ms
shared_cache=on:  cold KNN first=8.6ms  p50=9.5ms
```

The `off` path loads L0 pages lazily (only pages visited by beam search).
The `on` path deep-copies all bulk data upfront under LWLock. At 200K
nodes, the upfront copy dominates. Quality remains identical.

Current conclusion: `shared_cache=on` is a correctness-safe default but
not a performance feature at the measured scales. The deep-copy overhead
from the multi-index fix neutralized the original latency benefit.

### Synthetic multi-hop depth scaling (`relation_path` depth 1..5)

Repo-owned harness:

- `python3 scripts/bench_graph_rag_multidepth.py --num-pairs 5000 --max-depth 5 --query-count 32 --runs 3 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 --shared-buffers-mb 64`

Deterministic chain-shaped fact graph, `25K` rows total, `384D`, top-10,
fresh backend, path-aware scorer. This is a narrow scaling check for the
generic unified syntax:

- `relation_path := ARRAY[1]`
- `relation_path := ARRAY[1,2]`
- `relation_path := ARRAY[1,2,3]`
- `relation_path := ARRAY[1,2,3,4]`
- `relation_path := ARRAY[1,2,3,4,5]`

| Method | depth 1 | depth 2 | depth 3 | depth 4 | depth 5 | Quality |
|--------|:-------:|:-------:|:-------:|:-------:|:-------:|---------|
| Heap SQL path baseline | 0.573 ms | 0.589 ms | 0.613 ms | 0.622 ms | 0.622 ms | `100.0% / 100.0%` |
| sorted_heap SQL path baseline | 0.774 ms | 0.732 ms | 0.776 ms | 0.712 ms | 0.772 ms | `100.0% / 100.0%` |
| **sorted_heap_graph_rag(..., score_mode := 'path')** | **0.674 ms** | **0.651 ms** | **0.643 ms** | **0.633 ms** | **0.649 ms** | **`100.0% / 100.0%`** |

On this synthetic chain benchmark, the unified path-aware function does not
show a latency cliff through depth `5`; it stays in the same `~0.63-0.67 ms`
band while preserving `100.0% / 100.0%` quality. This is a controlled scaling
signal, not a claim about arbitrary deep graph workloads.

### Larger synthetic GraphRAG scale envelope (`1M` and `10M`)

Repo-owned harnesses:

- local:
  `python3 scripts/bench_graph_rag_multidepth.py --num-pairs 200000 --max-depth 5 --query-count 8 --runs 1 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 64 --m 24 --shared-buffers-mb 64 --max-wal-size-gb 32 --maintenance-work-mem-mb 4096 --table-scope sorted_heap_only`
- AWS:
  `NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=1 RUNS=1 DIM=32 ANN_K=16 TOP_K=10 EF_SEARCH=32 EF_CONSTRUCTION=8 M=8 SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 MAINTENANCE_WORK_MEM_MB=2048 TABLE_SCOPE=sorted_heap_only ./scripts/bench_graph_rag_multidepth_aws.sh <aws-host> /path/to/repo 65494`

Current verified scale picture:

- local `1M` rows (`200K` pairs x `5` hops, `384D`, cheap-build point):
  - `generate_csv`: `162.667 s`
  - `load_data`: `45.694 s`
  - `build_indexes`: `377.968 s`
  - unified path-aware `sorted_heap_graph_rag(...)` p50:
    - depth `1`: `1.250 ms`
    - depth `2`: `1.429 ms`
    - depth `3`: `1.777 ms`
    - depth `4`: `2.181 ms`
    - depth `5`: `2.604 ms`
  - quality stayed `100.0% / 100.0%` at every depth
- local `10M x 384D` no longer dies in Python generation after the streaming
  CSV change; the process stayed near `31 MiB RSS` on the early run where the
  old generator previously blew up.
- local `10M x 64D` and AWS `10M x 32D` both survive generation and load, but
  the first practical frontier is still the single `sorted_hnsw CREATE INDEX`
  build, not query execution.

The strongest bounded AWS signal so far is:

- `10M x 32D` (`2M` pairs x `5` hops, ultralight build point)
  - `generate_csv`: `223.114 s`
  - CSV size: `3.63 GiB`
  - `load_data`: `485.177 s`
  - on the old build path, it then entered a long `CREATE INDEX` that was
    still active after `~11-12` minutes in the index-build phase
  - temp-cluster footprint reached about `17-18 GiB` with `11-13 GiB` disk
    still free on the AWS host

After removing the per-search `visited[]` allocation/zeroing from the hot
HNSW build loop, the same local diagnostic point that previously established
the bottleneck changed materially:

- local `500K x 32D` (`100K` pairs x `5` hops, `m=8`, `ef_construction=8`)
  - old total build: `18.1-18.7 s`
  - old graph-construction phase: about `18.27 s`
  - new total build: `2.777-2.996 s`
  - new graph-construction phase: about `2.42-2.59 s`
  - page-writing tail stayed small (`~0.20-0.24 s`)

That moved the AWS `10M x 32D` scale branch from "still stuck in CREATE INDEX"
to the first real `10M` query pass on the same cheap-build point:

- `generate_csv`: `222.814 s`
- `load_data`: `468.923 s`
- `build_indexes`: `212.553 s`
- `analyze`: `14.562 s`
- unified path-aware `sorted_heap_graph_rag(...)` p50:
  - depth `1`: `2.124 ms`
  - depth `2`: `3.433 ms`
  - depth `3`: `4.611 ms`
  - depth `4`: `5.846 ms`
  - depth `5`: `5.139 ms`

This is **not** a release-quality GraphRAG point. The build/query knobs here
were intentionally ultra-light (`ann_k=16`, `ef_search=32`,
`ef_construction=8`, `m=8`) to get the first `10M` latency envelope. At those
settings, answer quality on the single-query AWS probe was poor (`0.0% / 0.0%`)
and should be treated as a scale smoke, not a publishable quality frontier.

I then kept that AWS temp cluster alive and ran query-only sweeps on the same
cheap-built `10M x 32D` graph to test whether query-time tuning alone could
recover depth-`5` quality without paying for another `CREATE INDEX`. It did
not:

- `ef_search=128`, `ann_k=64` -> depth-`5` `2.626 ms`, `0.0% / 0.0%`
- `ef_search=128`, `ann_k=128` -> depth-`5` `3.040 ms`, `0.0% / 0.0%`
- `ef_search=256`, `ann_k=128` -> depth-`5` `3.284 ms`, `0.0% / 0.0%`
- `ef_search=256`, `ann_k=256` -> depth-`5` `4.076 ms`, returned `10` rows,
  but still `0.0% / 0.0%`
- one pathological point, `ef_search=32`, `ann_k=64`, spiked to
  `1833.809 ms` and `259709.5` shared reads while still returning `0.0% / 0.0%`

I then ran the same question on a smaller but shape-matched local
`1M x 32D` graph (`200K` pairs x `5` hops) to separate build quality from
query contract:

- cheap local build points (`m=8`, `ef_construction=8/32`, `m=16`,
  `ef_construction=32/64`) stayed weak at the old narrow query contract
  (`ann_k=64`, `top_k=10`), topping out around `25.0% hit@k`
- but with a wider query contract on the same `1M x 32D` graph
  (`ann_k=256`, `top_k=32`, `ef_search=128`), both
  - the strong build (`m=24`, `ef_construction=200`), and
  - the much cheaper build (`m=16`, `ef_construction=64`)
  reached the same `96.9% hit@k` over `32` queries
- exact heap seeds on that `1M x 32D` graph matched the ANN result exactly at
  the widened point, so once the query budget is large enough, heavier builds
  were no longer the decisive lever there

That suggested a second AWS `10M x 32D` probe on the same cheap build but with
the widened local winner (`ann_k=256`, `top_k=32`, `ef_search=128`). The
result still failed:

- ANN path-aware `sorted_heap_graph_rag(...)`:
  - `top_k=10` -> `1832.345 ms`, `0.0% / 0.0%`
  - `top_k=32` -> `1832.755 ms`, `0.0% / 0.0%`
- exact heap seeds + the same path-aware expansion/rerank contract:
  - `ann_k=256`, `top_k=32` -> `0.0% / 0.0%`

So the cheap-build `10M x 32D` graph can produce latency numbers, but its
quality is not recoverable by either:

- larger query-time ANN budgets, or
- exact heap seeds on the same low-dimensional corpus

The `10M x 32D` frontier is therefore no longer "weak HNSW build quality."
It is the low-dimensional scale contract itself. The next meaningful scale
branch is a higher-dimensional `10M` point or a different retrieval contract,
not another `m/ef_construction` tweak on the same `32D` setup.

The first higher-dimensional calibration point was `64D`, and it changed the
local picture immediately. On a local `1M x 64D` graph (`200K` pairs x
`5` hops), a relatively cheap build (`m=16`, `ef_construction=64`) plus the
wider query contract (`ann_k=256`, `top_k=32`, `ef_search=128`) gave:

- ANN path-aware `sorted_heap_graph_rag(...)`: `65.6% hit@1`, `96.9% hit@k`
- exact heap seeds + the same path-aware expansion/rerank contract:
  `65.6% hit@1`, `96.9% hit@k`

At `top_k=10`, the same `1M x 64D` point already held `96.9% hit@k`, so unlike
`32D`, the result-budget cliff largely disappeared there.

The first file-backed `10M x 64D` AWS attempt on the current
`ubuntu@dev.rigelstar.com` host (`4 vCPU`, `8 GiB RAM`) failed for operational
reasons before query timing:

- `generate_csv`: `414.397 s`
- CSV size: `6.67 GiB`
- temp dir grew to about `28 GiB`
- `/tmp` fell to `1.3 GiB` free (`99%` used) while the run was still active

That led to a footprint reduction pass in the harness:

- stream fact rows directly into `COPY` instead of materializing a giant CSV
- drop `facts_heap` after loading `facts_sh` in `sorted_heap_only` mode
- allow query-only reuse from a kept temp cluster
- when `facts_heap` is not needed, copy rows straight into `facts_sh` before
  `sorted_heap_compact(...)` instead of staging through `facts_heap` and
  `INSERT .. ORDER BY`

That direct `facts_sh` load path held up on bounded local checks:

- `200K` rows (`40K` pairs x `5` hops, `64D`, `hop_weight=0.05`)
  - old staged path: `6321.453 ms` load, depth-5 unified GraphRAG
    `24.504 ms`, `100.0% / 100.0%`
  - direct `COPY facts_sh`: `5638.134 ms` load, depth-5 unified GraphRAG
    `24.312 ms`, `100.0% / 100.0%`
- `1M` rows (`200K` pairs x `5` hops, `64D`, load only)
  - old staged path: `31.392 s`
  - direct `COPY facts_sh`: `28.231 s`

So the current conclusion is narrow but useful: for synthetic
`sorted_heap_only` multidepth runs, direct `COPY facts_sh` is a real
`~10%` ingest win without giving up the compacted query-time locality that
the earlier "skip compaction" falsifier showed we still need.

I also bounded the obvious follow-up on the same direct ordered load:
parameterizing the post-load maintenance step as `none`, `merge`, or
`compact`.

- `200K` rows (`40K` pairs x `5` hops, `64D`, `hop_weight=0.05`)
  - `none`: `5159.888 ms` load, depth-5 unified GraphRAG `62.148 ms`
  - `merge`: `5749.190 ms` load, depth-5 unified GraphRAG `23.277 ms`
  - `compact`: `5626.887 ms` load, depth-5 unified GraphRAG `24.591 ms`
- `1M` rows (`200K` pairs x `5` hops, `64D`, load only)
  - `none`: `25.820 s`
  - `merge`: `28.142 s`
  - `compact`: `28.108 s`

So `merge` is now exposed as an experiment knob in the multidepth harness,
but it is not a proven new default. The current evidence says:

- `none` is too expensive at query time
- `merge` is viable on ordered synthetic loads
- `merge` does not materially beat `compact` on the larger `1M` load point

That leaves `compact` as the stable default for large-scale multidepth runs.

With that lower-footprint path, the same `10M x 64D` AWS point advanced
materially further on the same host:

- streamed `generate_csv`: `0.000 s` (`csv_bytes=0`, no materialized CSV)
- `load_data`: `916.030 s`
- temp dir plateaued around `19 GiB`
- filesystem still had about `11 GiB` free at the start of `CREATE INDEX`

The next frontier was no longer disk headroom. On the same `10M x 64D`
cheap-build point (`m=16`, `ef_construction=64`), the old local scan-cache
seeding path then failed inside PostgreSQL:

- `CREATE INDEX facts_sh_ann_idx ON facts_sh USING sorted_hnsw (embedding) WITH (m = 16, ef_construction = 64)`
- `ERROR: invalid memory alloc request size 1280000000`

That request size matched the old contiguous local L0 neighbor cache layout
(`n_nodes * 2 * M * sizeof(int32)` at `10M x M=16`), so the next fix was to
replace local `l0_neighbors + sq8_data` slabs with page-backed storage. Shared
immutable scan caches remain contiguous; local build seeding and local
`shnsw_load_cache()` no longer require a single giant L0 allocation.

With the chunked local-cache fix in place, the next retained branch was the
constrained-memory rerun via:

- `sorted_hnsw.build_sq8 = on`
- lower-hop synthetic contract (`hop_weight = 0.05`)

On the same AWS ARM64 host (`4 vCPU`, `8 GiB RAM`, `4 GiB swap`), the
monolithic `10M x 64D` point then completed cleanly:

- `load_data`: `787.809 s`
- `build_indexes`: `846.795 s`
- kept temp cluster: `/home/ubuntu/graphrag_tmp/graph_rag_y8cuntf9`

That is a real constrained-memory result: the same host that previously hit
early large-scale frontiers now built the full monolithic `sorted_hnsw` index
on `10,000,000` rows without being OOM-killed.

The first query-only reuse pass on that exact built graph (`query_count=4`,
`runs=1`, `ann_k=256`, `top_k=32`, `ef_search=128`, `sorted_heap_only`) gave:

- depth 1
  - SQL path: `1072.525 ms`, `100.0% / 100.0%`
  - unified GraphRAG path: `840.607 ms`, `100.0% / 100.0%`
- depth 2
  - SQL path: `1070.437 ms`, `75.0% / 100.0%`
  - unified GraphRAG path: `2055.479 ms`, `75.0% / 100.0%`
- depth 3
  - SQL path: `1050.793 ms`, `50.0% / 100.0%`
  - unified GraphRAG path: `2070.894 ms`, `50.0% / 100.0%`
- depth 4
  - SQL path: `1080.123 ms`, `50.0% / 100.0%`
  - unified GraphRAG path: `2066.717 ms`, `50.0% / 100.0%`
- depth 5
  - SQL path: `1064.405 ms`, `75.0% / 100.0%`
  - unified GraphRAG path: `2084.155 ms`, `75.0% / 100.0%`

So the `10M x 64D` monolithic branch is now narrowed sharply:

- constrained-memory monolithic build is **viable**
- quality stayed aligned between the SQL baseline and the unified GraphRAG path
- the remaining issue is not build survival and not quality drift
- the remaining issue is latency: at depth `2+`, the current generic unified
  GraphRAG path is still about `2x` slower than the SQL path baseline on this
  host and graph

That is exactly why the next scale story should move toward segmentation +
pruning rather than trying to turn one monolithic `10M` graph into the final
operating model for small hosts.

I then added an opt-in stage breakdown path to the multidepth harness via
`--report-stage-stats`, using `sorted_heap_graph_rag_stats()` after each
unified path-aware call. On the local `1M x 64D` lower-hop point
(`hop_weight=0.05`, `ann_k=256`, `top_k=32`, `ef_search=128`), that narrowed
the runtime picture sharply:

- depth 2:
  - end-to-end unified GraphRAG: `109.222 ms`
  - internal stage stats:
    - `ann_ms = 107.590`
    - `expand_ms = 0.369`
    - `rerank_ms = 0.004`
- depth 5:
  - end-to-end unified GraphRAG: `110.507 ms`
  - internal stage stats:
    - `ann_ms = 109.178`
    - `expand_ms = 0.691`
    - `rerank_ms = 0.011`

So the current `1M x 64D` depth cost is not expansion-bound. On the widened
contract, almost all measured time is already in the ANN seed stage; the
multi-hop expansion and rerank work remain sub-millisecond.

I then added an opt-in low-memory `sorted_hnsw` build path via
`SET sorted_hnsw.build_sq8 = on`. This keeps the final on-disk/index-query
contract the same, but the graph is built from SQ8-compressed build vectors
instead of a full float32 build slab. The tradeoff is one extra heap scan
during `CREATE INDEX`.

Bounded local A/B on the same `1M x 64D` lower-hop point (`m=16`,
`ef_construction=64`, `ann_k=256`, `top_k=32`, `ef_search=128`,
`sorted_heap_only`, streamed load):

- `build_sq8=off`
  - `load_data`: `28.044 s`
  - `build_indexes`: `48.606 s`
  - unified depth-5 GraphRAG: `111.578 ms`, `87.5% / 100.0%`
- `build_sq8=on`
  - `load_data`: `27.935 s`
  - `build_indexes`: `46.541 s`
  - unified depth-5 GraphRAG: `110.541 ms`, `87.5% / 100.0%`

So the first retained result is narrow but real:

- the low-memory build path did **not** regress the measured `1M x 64D`
  GraphRAG quality point
- it slightly improved build time on that point instead of paying a visible
  penalty for the extra heap scan
- and by construction it cuts the build-vector slab from `4 * N * D` bytes to
  `1 * N * D`
  - `10M x 64D`: about `2.56 GiB -> 0.64 GiB`
  - `10M x 384D`: about `15.36 GiB -> 3.84 GiB`

So the narrow conclusion is:

- the multi-hop GraphRAG path itself survives to at least `1M` rows and gives
  measured query latencies there
- the old `10M` frontier was **build-bound**, not Python-memory-bound,
  not CSV-generation-bound, and not an early PostgreSQL failure
- the current build optimization moved that frontier enough to obtain the
  first real `10M` query numbers on a cheap-build point
- the remaining `10M x 32D` problem is no longer "can it finish?" and no
  longer just "is the HNSW build too weak?"
- the stronger falsifier is that even exact seeds fail at that scale and
  dimensionality, so the next branch is a different scale contract, not a
  narrower HNSW tuning loop
- `64D` is the first scale contract that looks healthy locally at `1M`
- on the current AWS host, the `10M x 64D` branch now clears ingest, build,
  and the first query pass; the remaining cheap-build frontier is deep-path
  quality, not allocator failure

I then added a segmented multidepth harness in
`scripts/bench_graph_rag_multidepth_segmented.py`. This is the first
partitioning benchmark, but it is intentionally **harness-side**, not an
extension API: current GraphRAG helpers require a concrete `sorted_heap` table,
so the benchmark fans out across multiple `sorted_heap` shards and merges the
shard-local top-k rows in Python.

The local `1M x 64D` lower-hop point (`200K` pairs, `5` hops, `32` queries,
`ann_k=256`, `top_k=32`, `ef_search=128`, `m=16`, `ef_construction=64`,
`build_sq8=on`) gives a clean first answer:

- monolithic unified GraphRAG:
  - depth 1: `50.104 ms`, `100.0% / 100.0%`
  - depth 5: `121.524 ms`, `81.2% / 100.0%`
- segmented into `8` shards, **all-shard fanout**:
  - build_indexes: `57.885 s` versus monolithic `45.714 s`
  - depth 1: `87.677 ms`, `100.0% / 100.0%`
  - depth 5: `142.472 ms`, `81.2% / 100.0%`
- segmented into `8` shards, **exact routing** to the owning shard:
  - depth 1: `10.574 ms`, `100.0% / 100.0%`
  - depth 5: `16.822 ms`, `100.0% / 100.0%`

That makes the partitioning lesson explicit:

- segmentation by itself is **not** a free latency win
- all-shard fanout preserves quality on this benchmark, but pays a clear
  fanout tax
- the real win appears only when routing/pruning can avoid most shards
- so the next scale story should be "partitioning + pruning contract", not
  just "more shards"

This is the correct shape for large knowledge bases:

- on constrained hosts, sharding bounds per-index build memory
- on real workloads, the query path must also prune by tenant / knowledge-base
  / relation family / time window (or a future segment router), otherwise the
  system only trades one monolith for a broad fanout query

That local result transferred cleanly to the constrained AWS host on the first
full `10M x 64D` streamed segmented run (`2M` pairs, `5` hops, `8` shards,
`hop_weight=0.05`, `ann_k=256`, `top_k=32`, `ef_search=128`,
`m=16`, `ef_construction=64`, `build_sq8=on`):

- streamed segmented build-only:
  - `generate_csv`: `0.000 s`
  - `load_data`: `500.474 s`
  - `build_indexes`: `784.778 s`
- segmented, `route=exact` query-only reuse on the same built graph:
  - depth 1: `126.057 ms`, `100.0% / 100.0%`
  - depth 2: `261.986 ms`, `75.0% / 100.0%`
  - depth 3: `259.794 ms`, `75.0% / 100.0%`
  - depth 4: `258.879 ms`, `75.0% / 100.0%`
  - depth 5: `258.766 ms`, `100.0% / 100.0%`
- segmented, `route=all` query-only reuse on the same built graph:
  - depth 1: `898.440 ms`, `100.0% / 100.0%`
  - depth 2: `2090.866 ms`, `75.0% / 100.0%`
  - depth 3: `2089.650 ms`, `50.0% / 100.0%`
  - depth 4: `2088.114 ms`, `50.0% / 100.0%`
  - depth 5: `2093.652 ms`, `75.0% / 100.0%`

That gives the first full large-scale constrained-memory comparison on the
same AWS box:

- monolithic `10M x 64D`, low-memory build:
  - depth 1 unified GraphRAG: `840.607 ms`, `100.0% / 100.0%`
  - depth 5 unified GraphRAG: `2084.155 ms`, `75.0% / 100.0%`
- segmented `10M x 64D`, `route=all`:
  - effectively the same latency/quality envelope as the monolith
- segmented `10M x 64D`, `route=exact`:
  - about `6.7x` faster than the monolith at depth 1
  - about `8.1x` faster than the monolith at depth 5
  - and still stable at `100.0% / 100.0%` for depth 5 on this benchmark

So the large-scale result now matches the local `1M` lesson:

- segmentation without pruning is not a performance story
- streamed segmented ingest is operationally better than front-loaded shard CSV
  materialization
- segmented routing is the first constrained-memory scale path that improves
  both build viability and query latency on the same host

The next local step was to move routing out of benchmark-only Python and into a
real SQL reference path. On a kept local `5K`-row segmented smoke cluster
(`4` shards, `64D`, `ann_k=32`, `top_k=8`, `ef_search=64`), the new
range-routed SQL path matched the segmented SQL merge baseline on quality and
row counts:

- segmented SQL merge, `route=exact`, depth 5:
  - `0.215 ms`, `100.0% / 100.0%`
- metadata-routed SQL wrapper, same exact-route point:
  - `0.245 ms`, `100.0% / 100.0%`

So the current reference stack is now:

- `sorted_heap_graph_rag_segmented(...)` for explicit candidate shard arrays
- `sorted_heap_graph_rag_routed(...)` for simple metadata-driven range routing

The next local routing pass added an exact-key companion for tenant / KB style
selection. On another kept local `5K`-row segmented smoke cluster
(`4` shards, `64D`, `ann_k=32`, `top_k=8`, `ef_search=64`), the exact-key
routed path stayed aligned with the exact-route segmented SQL merge path:

- exact-route segmented SQL merge, depth 5:
  - `0.183 ms`, `100.0% / 100.0%`
- exact-key routed SQL wrapper, same point:
  - `0.202 ms`, `100.0% / 100.0%`

That is still not the final routing story for large knowledge bases, but it is
the first usable SQL-level bridge from harness-side segmentation to productized
segmented GraphRAG, with both:

- range-based routing
- exact-key routing

### Monolithic vs segmented exact-routed: constrained-memory `10M x 64D` comparison

This table consolidates the AWS `10M x 64D` results already reported
above (monolithic in "Larger synthetic GraphRAG scale envelope",
segmented in "Segmented GraphRAG at scale") into a direct side-by-side.

Both runs used the same AWS ARM64 host (`4 vCPU`, `8 GiB RAM`, `4 GiB swap`),
the same workload (`2M` pairs x `5` hops, `hop_weight=0.05`), and the same
build/query knobs (`m=16`, `ef_construction=64`, `build_sq8=on`, `ann_k=256`,
`top_k=32`, `ef_search=128`). Segmented mode used `8` shards.

| | Monolithic | Segmented exact | Segmented all |
|---|:---:|:---:|:---:|
| load_data | 787.8 s | 500.5 s | — |
| build_indexes | 846.8 s | 784.8 s | — |
| depth 1 p50 | 840.6 ms | **126.1 ms** | 898.4 ms |
| depth 5 p50 | 2084.2 ms | **258.8 ms** | 2093.7 ms |
| depth 1 hit@1/hit@k | 100%/100% | 100%/100% | 100%/100% |
| depth 5 hit@1/hit@k | 75%/100% | **100%/100%** | 75%/100% |
| speedup at depth 5 | 1x | **8.1x** | ~1x |

Exact commands (monolithic build + query-only reuse):

```
NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \
  ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \
  SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \
  BUILD_SQ8=on TABLE_SCOPE=sorted_heap_only \
  ./scripts/bench_graph_rag_multidepth_aws.sh <host> <dir> 65493
```

Exact commands (segmented build + query-only reuse):

```
NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \
  ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \
  SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \
  BUILD_SQ8=on SHARDS=8 ROUTE=both \
  ./scripts/bench_graph_rag_multidepth_segmented_aws.sh <host> <dir> 65494
```

Conclusion: on the measured `10M x 64D` constrained-memory point,
segmented exact routing is **8.1x faster** at depth 5 with **better quality**
(100%/100% vs 75%/100%), while build time is comparable. All-shard fanout
offers no latency benefit — the win comes entirely from shard pruning.

This is not a universal claim. Exact routing requires knowing which shard
owns the query entity, which maps to tenant-id / knowledge-base-id style
workloads. Queries that cannot be pruned to a single shard will see
all-shard-fanout latency (~1x monolithic).

### Bounded fanout: routing robustness on `1M x 64D`

This measures how the segmented win degrades when routing is imperfect.
Instead of routing to exactly one shard (exact) or all shards (all),
bounded fanout routes to K adjacent shards (always including the correct
one). This simulates imperfect routing where the router knows roughly
which shard, but hedges by including neighbors.

Local, `200K` pairs x `5` hops, `64D`, `8` shards, `32` queries, `3` runs,
`ann_k=256`, `top_k=32`, `ef_search=128`, `m=16`, `ef_construction=64`,
`build_sq8=on`, `hop_weight=0.05`:

```
python3 scripts/bench_graph_rag_multidepth_segmented.py \
  --num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \
  --dim 64 --ann-k 256 --top-k 32 --ef-search 128 \
  --ef-construction 64 --m 16 --build-sq8 on \
  --shards 8 --route {exact,bounded,all} --fanout {2,4} --hop-weight 0.05
```

Monolithic baseline (`sorted_heap_graph_rag(...)` on same workload):

```
python3 scripts/bench_graph_rag_multidepth.py \
  --num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \
  --dim 64 --ann-k 256 --top-k 32 --ef-search 128 \
  --ef-construction 64 --m 16 --build-sq8 on \
  --hop-weight 0.05 --table-scope sorted_heap_only
```

Depth-5 comparison (the hardest hop):

| Route mode | Shards hit | p50 (d5) | hit@1 (d5) | hit@k (d5) | Speedup vs monolith |
|---|:---:|:---:|:---:|:---:|:---:|
| monolithic | 1 table | 120.7 ms | 81.2% | 100.0% | 1x |
| segmented all (8/8) | 8 | 149.5 ms | 81.2% | 100.0% | 0.8x |
| segmented bounded (4/8) | 4 | 74.1 ms | 93.8% | 100.0% | 1.6x |
| segmented bounded (2/8) | 2 | 35.9 ms | 96.9% | 100.0% | 3.4x |
| segmented exact (1/8) | 1 | 17.2 ms | 100.0% | 100.0% | **7.0x** |

Depth-1 comparison:

| Route mode | Shards hit | p50 (d1) | hit@1 (d1) | hit@k (d1) |
|---|:---:|:---:|:---:|:---:|
| monolithic | 1 table | 49.4 ms | 100.0% | 100.0% |
| segmented all (8/8) | 8 | 94.3 ms | 100.0% | 100.0% |
| segmented bounded (4/8) | 4 | 46.3 ms | 100.0% | 100.0% |
| segmented bounded (2/8) | 2 | 22.8 ms | 100.0% | 100.0% |
| segmented exact (1/8) | 1 | 11.3 ms | 100.0% | 100.0% |

Observations:
- Latency scales roughly linearly with the number of shards hit.
- Quality degrades gracefully: bounded(2/8) still reaches 96.9% hit@1 at
  depth 5 vs 100.0% for exact, compared to 81.2% for monolithic/all.
- Even bounded(4/8) at half the shards is 1.6x faster than monolithic
  and has better quality (93.8% vs 81.2% hit@1).
- All-shard fanout is worse than monolithic (fanout overhead dominates).
- The win is not exact-or-nothing — bounded fanout preserves most of the
  benefit even with imperfect routing.

### Bounded fanout at AWS `10M x 64D` scale

Same AWS ARM64 host (`4 vCPU`, `8 GiB RAM`, `4 GiB swap`), same
workload as the `10M x 64D` comparison above (`2M` pairs x `5` hops,
`hop_weight=0.05`, `8` shards, `ann_k=256`, `top_k=32`, `ef_search=128`,
`m=16`, `ef_construction=64`, `build_sq8=on`). Build once, query-only
reuse for each route mode. `4` queries, `1` run.

Build: `load_data=500.7s`, `build_indexes=764.2s`.

Query commands:

```
# Build once with --keep-temp --stop-after build_indexes
NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \
  ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \
  SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \
  BUILD_SQ8=on SHARDS=8 ROUTE=exact \
  EXTRA_ARGS="--stream-copy --keep-temp --stop-after build_indexes" \
  bash scripts/bench_graph_rag_multidepth_segmented_aws.sh \
  <host> <dir> 65494

# Query-only reuse for each route mode:
ssh <host> "cd <dir> && python3 scripts/bench_graph_rag_multidepth_segmented.py \
  --port 65494 --num-pairs 2000000 --max-depth 5 --query-count 4 --runs 1 \
  --dim 64 --hop-weight 0.05 --ann-k 256 --top-k 32 --ef-search 128 \
  --ef-construction 64 --m 16 --build-sq8 on --shared-buffers-mb 64 \
  --shards 8 --route {exact,bounded,all} [--fanout {2,4}] \
  --backend-mode fresh --reuse-temp <kept_temp>"
```

Depth-5 comparison (monolithic baseline from prior section):

| Route mode | Shards hit | p50 (d5) | hit@1 (d5) | hit@k (d5) | vs monolith |
|---|:---:|:---:|:---:|:---:|:---:|
| monolithic | 1 table | 2084 ms | 75% | 100% | 1x |
| segmented all (8/8) | 8 | 2107 ms | 75% | 100% | ~1x |
| segmented bounded (4/8) | 4 | 1042 ms | 75% | 100% | **2.0x** |
| segmented bounded (2/8) | 2 | 520 ms | 100% | 100% | **4.0x** |
| segmented exact (1/8) | 1 | 259 ms | 100% | 100% | **8.0x** |

Depth-1 comparison:

| Route mode | Shards hit | p50 (d1) | hit@1 (d1) | hit@k (d1) | vs monolith |
|---|:---:|:---:|:---:|:---:|:---:|
| monolithic | 1 table | 841 ms | 100% | 100% | 1x |
| segmented all (8/8) | 8 | 920 ms | 100% | 100% | ~1x |
| segmented bounded (4/8) | 4 | 473 ms | 100% | 100% | **1.8x** |
| segmented bounded (2/8) | 2 | 233 ms | 100% | 100% | **3.6x** |
| segmented exact (1/8) | 1 | 137 ms | 100% | 100% | **6.1x** |

The bounded-fanout gradient from the local `1M` point transfers cleanly
to `10M`:

- Latency scales linearly with the number of shards hit at both scales.
- bounded(2/8) at `10M` is 4.0x faster than monolithic at depth 5 —
  close to the local 3.4x ratio.
- Even bounded(4/8) at half the shards gives a 2.0x win.
- Quality: hit@k stays at 100% across all modes. hit@1 varies with
  fanout width but stays ≥75%.
- All-shard fanout remains ~1x monolithic (fanout overhead cancels the
  per-shard size reduction).

Conclusion: bounded fanout remains materially useful on this measured
`10M` synthetic point. A router that can narrow to 2 out of 8 candidate
shards captures ~50% of the exact-routing speedup. The performance
gradient is smooth on this benchmark, not a cliff.

### Routing-miss tolerance on `1M x 64D`

This measures what happens when the router sometimes picks the wrong
shards — the correct shard is not always included in the candidate set.
Uses `--route bounded_recall --fanout 2 --recall-pct N` where N% of
queries get the correct shard and the rest get 2 adjacent wrong shards.
Miss schedule is deterministic per-query via SHA-256 hash (seed=42).

Local, `200K` pairs x `5` hops, `64D`, `8` shards, `32` queries, `3` runs,
`ann_k=256`, `top_k=32`, `ef_search=128`, `m=16`, `ef_construction=64`,
`build_sq8=on`, `hop_weight=0.05`:

```
python3 scripts/bench_graph_rag_multidepth_segmented.py \
  --num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \
  --dim 64 --ann-k 256 --top-k 32 --ef-search 128 \
  --ef-construction 64 --m 16 --build-sq8 on \
  --shards 8 --route bounded_recall --fanout 2 \
  --recall-pct {100,90,75,50,25,0} --hop-weight 0.05
```

Depth-5 routing-miss tolerance (fanout=2, 8 shards):

| Router recall | Queries hitting correct shard | p50 (d5) | hit@1 (d5) | hit@k (d5) |
|:---:|:---:|:---:|:---:|:---:|
| 100% | 32/32 | 54.5 ms | 96.9% | 100.0% |
| 90% | 29/32 | 64.7 ms | 87.5% | 90.6% |
| 75% | 23/32 | 57.0 ms | 68.8% | 71.9% |
| 50% | 18/32 | 45.4 ms | 53.1% | 56.2% |
| 25% | 10/32 | 79.1 ms | 31.2% | 31.2% |
| 0% | 0/32 | 44.8 ms | 0.0% | 0.0% |
| (ref: monolithic) | — | 120.7 ms | 81.2% | 100.0% |

Observations:
- Quality tracks router recall nearly linearly. There is no sharp cliff.
- A router with 90% recall keeps 87.5% hit@1 — close to the monolithic
  baseline's 81.2%, while remaining 2-3x faster.
- A router with 75% recall gives 68.8% hit@1 — below monolithic quality,
  so the latency win comes at a quality cost.
- Latency stays roughly constant across recall levels because the fanout
  (2 shards) is fixed. The router-miss doesn't cost extra I/O — it just
  returns wrong answers.
- This means: **routing quality determines answer quality, not latency**.
  The segmented latency win is structural (fewer rows per shard), but the
  quality win depends entirely on the router including the correct shard.

### Routing-miss tolerance at AWS `10M x 64D`

Same AWS ARM64 host and workload as the bounded-fanout section above.
Build once (`load=500.5s`, `build=803.5s`), query-only reuse.
`4` queries, `1` run, `fanout=2`, `8` shards.

```
# Build: same as bounded-fanout section
# Query-only reuse with bounded_recall:
ssh <host> "cd <dir> && python3 scripts/bench_graph_rag_multidepth_segmented.py \
  --port 65494 --num-pairs 2000000 --max-depth 5 --query-count 4 --runs 1 \
  --dim 64 --hop-weight 0.05 --ann-k 256 --top-k 32 --ef-search 128 \
  --ef-construction 64 --m 16 --build-sq8 on --shared-buffers-mb 64 \
  --shards 8 --route bounded_recall --fanout 2 --recall-pct {90,75} \
  --backend-mode fresh --reuse-temp <kept_temp>"
```

Depth-5 routing-miss tolerance at `10M`:

| Route mode | Router recall | p50 (d5) | hit@1 (d5) | hit@k (d5) |
|---|:---:|:---:|:---:|:---:|
| bounded optimistic | 100% (4/4 hit) | 519 ms | 100% | 100% |
| bounded_recall | 90% (3/4 hit) | 521 ms | 75% | 75% |
| bounded_recall | 75% (3/4 hit) | 522 ms | 75% | 75% |
| (ref: monolithic) | — | 2084 ms | 75% | 100% |
| (ref: exact) | 100% | 259 ms | 100% | 100% |

The 90% and 75% recall points produce the same result because with only
`4` queries, SHA-256 hash gives `3/4` correct-shard inclusion at both
settings. The `1` missed query returns wrong results, giving `75%` hit@1.

Despite limited query-count resolution, the directional signal is clear:
- Bounded(2/8) latency stays at ~520 ms regardless of recall (~4x
  faster than monolithic)
- One missed query out of four drops hit@1 from 100% to 75% — matching
  the monolithic baseline's own 75% hit@1
- hit@k drops from 100% to 75% (monolithic gets 100% hit@k because its
  broader candidate set is more likely to include the target)

This confirms the local finding: **routing quality determines answer
quality, not latency.** The latency win is structural and survives
routing errors. At 90% router recall on this 10M point, bounded routing
matches the monolithic hit@1 (75%) while staying 4x faster.

A higher `query_count` run would give finer crossover resolution. On the
measured 4-query point, the evidence is narrower: at 90% recall,
bounded routing matches the monolithic hit@1 while staying 4x faster,
and the true crossover is not shown to be above that point.

### Current AWS GraphRAG benchmark (`person -> parent -> city`, stable fact contract)

Repo-owned harness:

- `REMOTE_PYTHON=/path/to/python QUERY_COUNT=64 RUNS=3 NUM_PAIRS=5000 DIM=384 ANN_K=64 TOP_K=10 EF_SEARCH=128 EF_CONSTRUCTION=200 M=24 PGV_EF_SEARCH=64 ZVEC_EF=64 QDRANT_EF=64 SHARED_BUFFERS_MB=64 BACKEND_MODE=fresh ./scripts/bench_graph_rag_multihop_aws.sh <aws-host> /path/to/repo 65492`

AWS ARM64 host (`4 vCPU`, `8 GiB RAM`), deterministic fact graph,
`5K` chains / `10K` rows total, `384D`, top-10, `64` queries, `3` runs.
This is the current portable stable multihop GraphRAG point for the narrow
fact-shaped contract.

| Method | p50 latency | hit@1 | hit@k | Notes |
|--------|:-----------:|:-----:|:-----:|-------|
| Heap two-hop SQL | 1.088 ms | 75.0% | 96.9% | exact rerank over expanded heap set |
| sorted_heap_expand_twohop_rerank() | 0.952 ms | 75.0% | 96.9% | older city-only helper |
| sorted_heap_graph_rag_twohop_scan() | 1.012 ms | 75.0% | 96.9% | older city-only wrapper |
| sorted_heap SQL `pathsum` baseline | 1.204 ms | 98.4% | 98.4% | same ANN seeds, `hop1_distance + hop2_distance` |
| **sorted_heap_expand_twohop_path_rerank()** | **0.955 ms** | **98.4%** | **98.4%** | fused path-aware helper |
| sorted_heap_graph_rag_twohop_path_scan() | 1.018 ms | 98.4% | 98.4% | fused path-aware wrapper |
| pgvector HNSW + heap expansion | 1.422 ms | 85.9% | 85.9% | path-aware rerank, `ef_search=64` |
| zvec HNSW + heap expansion | 1.720 ms | 100.0% | 100.0% | path-aware rerank, `ef=64` |
| Qdrant HNSW + heap expansion | 3.435 ms | 100.0% | 100.0% | path-aware rerank, `hnsw_ef=64` |

The new AWS result matches the local diagnostic cleanly: the dominant quality
loss on this workload was the old hop-2-only rerank contract, not the seed ANN
frontier. The path-aware helper preserves the quality gain on ARM64 with only
trivial latency cost versus the older helper.

On the apples-to-apples path-aware contract, the portable frontier is now:
- `sorted_heap` fastest
- `zvec` and Qdrant strongest on answer quality
- `pgvector` still behind on both latency and quality at this operating point

One intermediate AWS all-engines rerun temporarily dropped the `sorted_heap`
path-aware rows to `96.9%` / `96.9%`. An immediate `sorted_heap`-only control
and a second full all-engines rerun both returned the stable `98.4% / 98.4%`
point above, so the published table uses the confirmed rerun rather than the
single outlier.

An AWS repeated-build protocol then tightened that confidence band on the same
balanced `5K` point. Using three independent fresh builds:

- `sorted_heap_expand_twohop_path_rerank()` median `0.962 ms`, range
  `0.956-0.965 ms`, `hit@1/hit@k = 98.4/98.4` on all three builds
- `sorted_heap_graph_rag_twohop_path_scan()` median `1.025 ms`, range
  `1.018-1.043 ms`, `hit@1/hit@k = 98.4/98.4` on all three builds
- `pgvector` path-aware parity row median `1.434 ms`, `hit@1/hit@k`
  `84.4-89.1`
- `zvec` path-aware parity row median `1.711 ms`, `100.0/100.0`
- `Qdrant` path-aware parity row median `3.355 ms`, `100.0/100.0`

So on the current portable `5K` point, the earlier AWS outlier now looks like
an anomaly rather than a broad instability. The balanced `sorted_heap`
path-aware rows stayed fixed across all three rebuilds.

The larger `10K`-chain AWS rerun now tells a different story than the older
city-only benchmark. At the same portable point:

- heap two-hop SQL: `1.319 ms`, `hit@1 71.9%`, `hit@k 92.2%`
- city-only `sorted_heap_graph_rag_twohop_scan()`: `1.197 ms`, `hit@1 73.4%`, `hit@k 93.8%`
- SQL `pathsum` baseline: `1.436 ms`, `hit@1 96.9%`, `hit@k 98.4%`
- `sorted_heap_expand_twohop_path_rerank()`: `1.185 ms`, `hit@1 96.9%`, `hit@k 98.4%`
- `sorted_heap_graph_rag_twohop_path_scan()`: `1.212 ms`, `hit@1 96.9%`, `hit@k 98.4%`

So the old larger-scale caveat now narrows materially: the main `10K` loss was
also the city-only rerank contract, not a fundamental collapse of the seed
frontier at that scale.

An AWS repeated-build protocol then checked whether the remaining `10K`
difference was really a build-variance problem. Using three independent fresh
builds on the same `10K` path-aware point:

- `sorted_heap_expand_twohop_path_rerank()` median `1.177 ms`, range
  `1.148-1.191 ms`, `hit@1/hit@k = 95.3/96.9` on all three builds
- `sorted_heap_graph_rag_twohop_path_scan()` median `1.236 ms`, range
  `1.211-1.240 ms`, `hit@1/hit@k = 95.3/96.9` on all three builds
- `pgvector` path-aware parity row median `1.667 ms`, `hit@1/hit@k`
  `76.6-82.8`
- `zvec` path-aware parity row median `2.788 ms`, `98.4/100.0`
- `Qdrant` path-aware parity row median `3.818 ms`, `98.4/100.0`

So the larger `10K` AWS point is now repeated-build stable too. The remaining
issue is scale frontier, not build instability: the `10K` quality band is
lower than `5K`, but it stayed fixed across fresh builds.

An exact-seed diagnostic on the local `5K` and `10K` points did not improve
`hit@1` or `hit@k` versus the ANN-seeded `sorted_heap` helper. So on this
benchmark shape, the remaining gap is not explained by ANN approximation alone.
The stronger result is that seed coverage itself was identical for ANN and
exact seeds: `98.4%` at `5K` and `96.9%` at `10K`. So the remaining loss is
downstream of seeding. The new rerank-rank diagnostic narrows that further:
at `5K`, the correct city is still within the top 6 for 95% of reachable
queries, and at `10K` it is still within the top 3 for 95% of reachable
queries. The quality drop is therefore driven by a few severe outliers
(`max rank 17` at `5K`, `20` at `10K`), not by a broad collapse.

A path-aware SQL rerank baseline then changed the picture materially. Keeping
the same ANN seeds and the same two-hop expansion, but scoring candidates as
`hop1_distance + hop2_distance`, moved the local balanced points to:
- `5K`: `0.957 ms`, `hit@1 98.4%`, `hit@k 98.4%`
- `10K`: `1.179 ms`, `hit@1 95.3%`, `hit@k 96.9%`

That branch is now implemented in the extension and verified on both local and
AWS ARM64 runs. The fused path-aware helper measured:
- local `5K`: `0.726 ms`, `hit@1 98.4%`, `hit@k 98.4%`
- local `10K`: `0.823 ms`, `hit@1 95.3%`, `hit@k 96.9%`
- AWS `5K`: `0.955 ms`, `hit@1 98.4%`, `hit@k 98.4%`
- AWS `10K`: `1.185 ms`, `hit@1 96.9%`, `hit@k 98.4%`

So the current strongest portable GraphRAG result is no longer the SQL
baseline or the old city-only helper. It is the fused path-aware helper.

### Current real code-corpus GraphRAG reference benchmark (`cogniformerus` CrossFile)

Repo-owned harnesses:

- `python3 scripts/bench_graph_rag_code_corpus.py --runs 3 --backend-mode fresh --ann-k 16 --top-k 4`
- `python3 scripts/repeat_graph_rag_code_corpus_builds.py --repeats 3 --runs 3 --backend-mode fresh`
- `REMOTE_PYTHON=/path/to/python REPEATS=3 RUNS=3 BACKEND_MODE=fresh bash scripts/repeat_graph_rag_code_corpus_builds_aws.sh <aws-host> /path/to/repo 65320`

This benchmark uses the actual `cogniformerus` source tree (`40` files,
`840` rows after chunk + summary expansion) and the real CrossFile prompts from
`butler_code_test.cr`. The current real-corpus conclusion is **not** a single
universal winner. The frontier splits by embedding mode:

| Mode | Best case | Local repeated-build p50 | AWS repeated-build p50 | Keyword coverage | Full hits | Notes |
|------|-----------|:------------------------:|:----------------------:|:----------------:|:---------:|-------|
| generic | `prompt_summary_snippet_py` | `0.613 ms` | `0.955 ms` | `100.0%` | `100.0%` | symbol-aware variant is strictly slower with no quality gain |
| code-aware | `prompt_symbol_summary_snippet_py` | `0.963 ms` | `1.541 ms` | `100.0%` | `100.0%` | exact prompt-symbol rescue is required in summary seeding |

The most important diagnostic result was the old code-aware miss:

- `prompt_summary_snippet_py` on `facts_sh`
  - local repeated-build: `97.6%` keyword coverage, `83.3%` full hits
  - AWS repeated-build: `97.6%`, `83.3%`
- `prompt_symbol_summary_snippet_py` on `facts_sh`
  - local repeated-build: `100.0%`, `100.0%`
  - AWS repeated-build: `100.0%`, `100.0%`

So the code-aware quality win is now both repeated-build stable and
cross-environment stable. The change in winner is not a local-only artifact.

### Larger in-repo `cogniformerus` transfer gate

The smaller `40`-file code-corpus slice above is a useful stable benchmark, but
it is not the only in-repo transfer check anymore. The same repeated-build
protocol was rerun on the full `cogniformerus` repository (`183` Crystal
files), still using the real CrossFile prompts from `butler_code_test.cr`.

Control point at the old tiny-budget contract (`top_k=4`, `1` fresh build):

| Mode | Best case | Local p50 | Keyword coverage | Full hits | Avg returned rows | Notes |
|------|-----------|:---------:|:----------------:|:---------:|:-----------------:|-------|
| generic | `prompt_summary_snippet_py` | `0.770 ms` | `87.1%` | `66.7%` | `3.67` | larger corpus exposes a result-budget cliff |
| code-aware | `prompt_symbol_summary_snippet_py` | `1.824 ms` | `87.6%` | `66.7%` | `4.00` | same cliff under code-aware embeddings |

Bounded recovery point (`top_k=8`, `3` fresh builds):

| Mode | Best case | Local repeated-build p50 | Keyword coverage | Full hits | Avg returned rows | Notes |
|------|-----------|:------------------------:|:----------------:|:---------:|:-----------------:|-------|
| generic | `prompt_summary_snippet_py` | `0.819 ms` | `100.0%` | `100.0%` | `6.33` | larger in-repo Crystal transfer now verified |
| code-aware | `prompt_symbol_summary_snippet_py` | `1.814 ms` | `100.0%` | `100.0%` | `7.50` | same winner, but needs the larger final budget |

Interpretation:

- the current real code-corpus contracts do transfer beyond the tiny `40`-file
  slice
- the dominant larger-corpus issue on the in-repo Crystal side is result
  budget, not a new retrieval failure
- this larger-corpus Crystal gate is now covered and serves as a transfer
  check for the benchmark-side code-retrieval logic, not as the stable
  release contract for GraphRAG

### Mixed-language external code-corpus GraphRAG reference benchmark (`pycdc`)

The code-corpus harness now also supports:

- JSON question fixtures
- configurable source extensions
- quoted local `#include "..."` dependency edges for C/C++ corpora

The first mixed-language adversary corpus was `pycdc`, using a repo-owned
fixture in `scripts/fixtures/graph_rag_pycdc_questions.json`. This run used the
real `pycdc` source tree (`138` files, `1281` rows after chunk + summary
expansion, `72` local dependency edges) and `3` fresh builds at `top_k=8`.

| Mode | Best case | Local repeated-build p50 | Keyword coverage | Full hits | Avg returned rows | Notes |
|------|-----------|:------------------------:|:----------------:|:---------:|:-----------------:|-------|
| generic | `prompt_symbol_summary_snippet_py` | `0.850 ms` | `90.0%` | `60.0%` | `6.40` | fastest mixed-language point, but it does not close the corpus |
| code-aware | `prompt_compactseed_require_summary_snippet_fn` | `8.006 ms` | `100.0%` | `100.0%` | `5.80` | helper-backed compact lexical seed + one-hop include rescue closes the corpus |

Interpretation:

- the mixed-language gate is now covered for a real `~/Projects/C` corpus
- the result split is sharper than on the Crystal corpora:
  - the fast generic path plateaus below perfect quality
  - the code-aware include-rescue path closes the corpus, but at a much higher
    latency
- so `~/Projects/C` is now covered as part of the broader code-corpus
  reference matrix, even though the fastest generic point remains partial

### Archive-side code-corpus GraphRAG reference benchmark (`ninja/src`)

The same widened harness was then pointed at an archive-side corpus under
`~/SrcArchives`: `apple/ninja/src`. This run used a second repo-owned fixture in
`scripts/fixtures/graph_rag_ninja_questions.json` and the local include graph
inside `ninja/src` (`103` files, `1757` rows after chunk + summary expansion,
`282` dependency edges).

Initial smoke at `top_k=8`:

| Mode | Best fast case | Local p50 | Keyword coverage | Full hits | Notes |
|------|----------------|:---------:|:----------------:|:---------:|-------|
| generic | `prompt_summary_snippet_py` | `0.898 ms` | `95.0%` | `80.0%` | already close without rescue |
| code-aware | `prompt_summary_snippet_py` | `0.928 ms` | `85.0%` | `80.0%` | code-aware mode is weaker on this corpus |

Bounded budget probe (`top_k=12`, `3` fresh builds):

| Mode | Best case | Local repeated-build p50 | Keyword coverage | Full hits | Avg returned rows | Notes |
|------|-----------|:------------------------:|:----------------:|:---------:|:-----------------:|-------|
| generic | `prompt_summary_snippet_py` | `0.914 ms` | `100.0%` | `100.0%` | `7.80` | archive-side gate closes with a small result-budget bump |
| code-aware | `prompt_summary_snippet_py` | `0.871 ms` | `85.0%` | `80.0%` | `7.60` | still not the winner on this corpus |

Interpretation:

- the `~/SrcArchives` side is now covered by a real repeated-build gate
- unlike `pycdc`, this archive corpus does **not** need a dependency-rescue
  contract to close
- the winner is the simple generic summary-snippet path at a slightly larger
  final budget (`top_k=12`)
- the larger real-corpus verification matrix for the narrow `0.13`
  fact-graph release now spans:
  - `~/Projects/Crystal`
  - `~/Projects/C`
  - `~/SrcArchives`

### External folding stress corpus for GraphRAG reference logic (`folding/src`)

The same harness was then pointed at a second real code corpus outside this
repository: `folding/src` with prompts from `butler_folding_test.cr`. This is
not the primary publishable frontier for the repository, but it is a strong
adversary corpus because it falsifies overfit retrieval contracts quickly.

Current repeated-build result:

| Mode | Case | Local repeated-build p50 | AWS repeated-build p50 | Keyword coverage | Full hits | Notes |
|------|------|:------------------------:|:----------------------:|:----------------:|:---------:|-------|
| generic | `prompt_summary_snippet_py` | `1.048 ms` | `1.540 ms` | `90.5%` | `83.3%` | fast baseline drifts below perfect quality on this corpus |
| generic | `prompt_compactseed_require_summary_snippet_fn` | `5.940 ms` | `8.839 ms` | `100.0%` | `100.0%` | compact lexical seed table + helper-backed one-hop `REQUIRES_FILE` rescue |
| generic | `prompt_lexseed_require_summary_snippet_fn` | `28.266 ms` | `41.960 ms` | `100.0%` | `100.0%` | historical full-summary lexical rescue, now dominated |
| code-aware | `prompt_summary_snippet_py` | `1.080 ms` | `1.775 ms` | `79.8%` | `66.7%` | worse baseline than the primary `cogniformerus` corpus |
| code-aware | `prompt_compactseed_require_summary_snippet_fn` | `5.804 ms` | `8.392 ms` | `100.0%` | `100.0%` | compact lexical seed table + helper-backed one-hop `REQUIRES_FILE` rescue |
| code-aware | `prompt_lexseed_require_summary_snippet_fn` | `36.676 ms` | `60.457 ms` | `100.0%` | `100.0%` | historical full-summary lexical rescue, now dominated |

Interpretation:

- the external folding miss was a real seed-selection problem, not a snippet
  extraction bug
- the rescue is now verified on both local Apple Silicon and AWS ARM64
- the current documented external rescue is no longer the old full-summary
  lexical path; it is the compact lexical-seed table variant
- compact lexical seeding keeps `100.0% / 100.0%` while cutting the old rescue
  by about `4.8x` locally and `4.7-7.2x` on AWS, depending on mode
- an isolated local timing split shows the helper-backed rescue is still
  dominated by lexical-seed + `REQUIRES_FILE` fetch work (`~10.7-11.0 ms/query`)
  with snippet postprocess as a secondary cold-start cost (`~7.7-8.0 ms/query`)
- the old full-summary lexical rescue remains useful as a diagnostic, but it is
  no longer the external default frontier
- even the compact rescue is still slower than the primary in-repo winners, so
  it does not replace them as the default GraphRAG contract

### Legacy/manual IVF-PQ benchmark

The sections below are still useful for the explicit IVF-PQ API
(`svec_ann_scan`), but they are no longer the default ANN baseline for the
repository. Those measurements target the legacy/manual vector path, not the
planner-integrated `sorted_hnsw` Index AM.

All IVF-PQ benchmarks below use `svec_ann_scan` (C-level) with residual PQ.
1 Gi k8s pod, PostgreSQL 18.

### 103K vectors, 2880-dim (Gutenberg corpus)

Residual PQ (M=720, dsub=4), 256 IVF partitions.
100 cross-queries (self-match excluded):

| Config | R@1 | Recall@10 | Avg latency |
|---|---|---|---|
| nprobe=1, PQ-only | 54% | 48% | 5.5 ms |
| nprobe=3, PQ-only | 79% | 71% | 8 ms |
| nprobe=3, rerank=96 | 82% | 74% | 10 ms |
| nprobe=5, rerank=96 | 89% | 86% | 12 ms |
| nprobe=10, rerank=200 | 97% | 94% | 22 ms |

Self-query (vector in dataset): R@1 = 100% at nprobe=3 / 8 ms.

### 10K vectors, 2880-dim (float32 precision test)

Same corpus, pure svec (float32), nlist=64, M=720 residual PQ.
100 cross-queries:

| Config | R@1 | Recall@10 |
|---|---|---|
| nprobe=1, PQ-only | 56% | 56% |
| nprobe=3, PQ-only | 72% | 82% |
| nprobe=5, rerank=96 | 93% | 93% |
| nprobe=10, rerank=200 | 99% | 99% |

### float32 vs float16 precision impact

Tested the same 10K Gutenberg vectors in two configurations:
- **float32 (svec):** native 32-bit storage, independently trained codebooks
- **float16-degraded (hsvec):** svec → hsvec → svec roundtrip, independently trained

Result: **no measurable recall difference**. Float16 precision loss (~1e-7) is
1000× smaller than typical distance gaps between consecutive neighbors (~1e-4).
The recall bottleneck is PQ quantization and IVF routing, not input precision.
This confirms hsvec is a safe storage choice for ANN workloads.

### CRUD performance (500K rows, svec(128), prepared mode)

| Operation | eager / heap | lazy / heap | Notes |
|-----------|:-----------:|:-----------:|-------|
| SELECT PK | 85% | 85% | Index Scan via btree |
| SELECT range 1K | 97% | -- | Custom Scan pruning (eager only) |
| Bulk INSERT | 100% | 100% | Always eager |
| DELETE + INSERT | 63% | 63% | INSERT always eager |
| UPDATE non-vec | 46% | **100%** | Lazy skips zone map flush |
| UPDATE vec col | 102% | **100%** | Parity both modes |
| Mixed OLTP | 83% | **97%** | Near-parity with lazy |

Eager mode (default) maintains zone maps on every UPDATE for scan pruning.
Lazy mode (`sorted_heap.lazy_update = on`) trades scan pruning for UPDATE
parity with heap. Compact/merge restores pruning.

### Self-query vs cross-query

**Self-query**: query vector is in the dataset (typical RAG case — you
embedded documents, now you search them). The vector is always found as its
own closest neighbor, so R@1 = 100%.

**Cross-query**: query vector is NOT in the dataset (e.g., user question
embedded at search time). R@1 depends on nprobe and PQ fidelity.

When comparing benchmarks, verify whether self-match is included or excluded.
The tables above use cross-query (self-match excluded) for honest comparison.

---

## Methodology notes

- **EXPLAIN ANALYZE:** warm cache (pg_prewarm), average of 5 runs, actual
  execution time + buffer reads reported
- **pgbench:** 10 s runtime, 1 client, includes pgbench overhead (connection
  management, query dispatch); useful for relative throughput comparison
- **INSERT:** COPY path via `INSERT ... SELECT generate_series()`
- **Compact time:** wall-clock time for `sorted_heap_compact()` on warm data
- **Vector search:** 100 random queries from the dataset, self-match excluded
  by requesting `lim := 11` and taking positions 2–11. Ground truth via exact
  brute-force cosine (`<=>` operator). Latency measured via `clock_timestamp()`
  per-query in PL/pgSQL loop (20 queries, warm cache)
- **Current local `sorted_hnsw` comparison:** deterministic synthetic 10K x
  384D corpus via `scripts/bench_sorted_hnsw_vs_pgvector.sh`, 3 fresh builds
  for PostgreSQL methods, median p50 / median recall reported; Qdrant via
  `scripts/bench_qdrant_synthetic.py`, 3 warm measurement passes on one local
  Docker collection; zvec via `scripts/bench_zvec_synthetic.py`, 3 warm
  measurement passes on one local in-process collection
- **Current local real-dataset sample:** `scripts/bench_ann_real_dataset.py`
  on ANN-Benchmarks `nytimes-256-angular`, sampled to 10K base vectors and 20
  queries. Ground truth is exact PostgreSQL `svec` heap search on the sampled
  corpus. Numbers above are medians across 3 full harness runs.