--- layout: default title: Benchmarks nav_order: 6 --- # Benchmarks All benchmarks run on PostgreSQL 18, Apple M-series (12 CPU, 64 GB RAM) with `shared_buffers=4GB`, `work_mem=256MB`, `maintenance_work_mem=2GB`. Warm cache, average of 5 runs. --- ## EXPLAIN ANALYZE -- query latency and buffer reads ### 1M rows (71 MB) | Query | sorted_heap | heap + btree | heap seqscan | |-------|------------|-------------|-------------| | Point (1 row) | 0.035 ms / 1 buf | 0.046 ms / 7 bufs | 15.2 ms / 6,370 bufs | | Narrow (100 rows) | 0.043 ms / 2 bufs | 0.067 ms / 8 bufs | 16.2 ms / 6,370 bufs | | Medium (5K rows) | 0.434 ms / 33 bufs | 0.492 ms / 52 bufs | 16.1 ms / 6,370 bufs | | Wide (100K rows) | 7.5 ms / 638 bufs | 8.9 ms / 917 bufs | 17.4 ms / 6,370 bufs | ### 10M rows (714 MB) | Query | sorted_heap | heap + btree | heap seqscan | |-------|------------|-------------|-------------| | Point (1 row) | 0.034 ms / 1 buf | 0.047 ms / 7 bufs | 117.9 ms / 63,695 bufs | | Narrow (100 rows) | 0.037 ms / 1 buf | 0.062 ms / 7 bufs | 130.9 ms / 63,695 bufs | | Medium (5K rows) | 0.435 ms / 32 bufs | 0.549 ms / 51 bufs | 131.0 ms / 63,695 bufs | | Wide (100K rows) | 7.6 ms / 638 bufs | 8.8 ms / 917 bufs | 131.4 ms / 63,695 bufs | ### 100M rows (7.8 GB) | Query | sorted_heap | heap + btree | heap seqscan | |-------|------------|-------------|-------------| | Point (1 row) | 0.045 ms / **1 buf** | 0.506 ms / 8 bufs | 1,190 ms / 519,906 bufs | | Narrow (100 rows) | 0.166 ms / 2 bufs | 0.144 ms / 9 bufs | 1,325 ms / 520,782 bufs | | Medium (5K rows) | 0.479 ms / 38 bufs | 0.812 ms / 58 bufs | 1,326 ms / 519,857 bufs | | Wide (100K rows) | 7.9 ms / 737 bufs | 10.1 ms / 1,017 bufs | 1,405 ms / 518,896 bufs | At 100M rows, a point query reads **1 buffer** (vs 8 for btree, 519,906 for sequential scan). --- ## pgbench throughput (TPS) ### Prepared mode (`-M prepared`) Query planned once, re-executed with parameters. 10 s, 1 client. | Query | 1M (sh / btree) | 10M (sh / btree) | 100M (sh / btree) | |-------|----------------:|------------------:|-------------------:| | Point | 46.9K / 59.4K | 46.5K / 58.0K | 32.6K / 43.6K | | Narrow | 22.3K / 29.1K | 22.5K / 28.8K | 17.9K / 18.1K | | Medium | 3.4K / 5.1K | 3.4K / 4.8K | 2.4K / 2.4K | | Wide | 295 / 289 | 293 / 286 | 168 / 157 | ### Simple mode (`-M simple`) Each query parsed, planned, and executed separately. 10 s, 1 client. | Query | 1M (sh / btree) | 10M (sh / btree) | 100M (sh / btree) | |-------|----------------:|------------------:|-------------------:| | Point | 28.4K / 38.0K | 29.1K / 41.4K | **18.7K / 4.6K** | | Narrow | 19.6K / 24.4K | 21.8K / 27.6K | **7.1K / 5.5K** | | Medium | 3.1K / 3.7K | 3.4K / 4.8K | **2.1K / 1.6K** | | Wide | 198 / 290 | 200 / 286 | **163 / 144** | At 100M rows in simple mode, sorted_heap wins all query types. Point queries reach 18.7K TPS vs 4.6K for btree (4x). --- ## INSERT and compaction throughput | Scale | sorted_heap | heap + btree | heap (no index) | compact time | |------:|------------:|-----------:|--------------:|-----------:| | 1M | 923K rows/s | 961K | 1.91M | 0.3 s | | 10M | 908K rows/s | 901K | 1.65M | 3.1 s | | 100M | 840K rows/s | 1.22M | 2.22M | 41.3 s | ### Storage | Scale | sorted_heap | heap + btree | heap (no index) | |------:|------------:|------------:|--------------:| | 1M | 71 MB | 71 MB (50 + 21) | 50 MB | | 10M | 714 MB | 712 MB (498 + 214) | 498 MB | | 100M | 7.8 GB | 7.8 GB (5.7 + 2.1) | 5.7 GB | sorted_heap stores data + zone map in the same space as a heap table. The zone map replaces the btree index, so total storage is comparable to heap-without-index -- roughly 30% less than heap + btree at scale. --- ## Vector search ### Current local synthetic benchmark (`sorted_hnsw` Index AM) Repo-owned harnesses: - `python3 scripts/bench_gutenberg_local_dump.py --dump /tmp/cogniformerus_backup/cogniformerus_backup.dump --port 65473` - `REMOTE_PYTHON=/path/to/python SH_EF=32 EXTRA_ARGS='--sh-ef-construction 200' ./scripts/bench_gutenberg_aws.sh /path/to/repo /path/to/dump 65485` - `scripts/bench_sorted_hnsw_vs_pgvector.sh /tmp 65485 10000 20 384 10 vector 64 96` - `python3 scripts/bench_ann_real_dataset.py --dataset nytimes-256 --sample-size 10000 --queries 20 --k 10 --pgv-ef 64 --sh-ef 96 --zvec-ef 64 --qdrant-ef 64` - `python3 scripts/bench_qdrant_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64` - `python3 scripts/bench_zvec_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64` ### Current AWS restored-corpus benchmark (`~104K x 2880D`, Gutenberg dump) AWS ARM64 host (4 vCPU, 8 GiB RAM), top-10, restored PostgreSQL custom dump. Ground truth is recomputed by exact heap search on the restored `svec` table. In the current rerun the stored `bench_hnsw_gt` table matched the exact heap GT on 100% of the 50 benchmark queries, so the fresh exact heap GT and the historical GT table agree. This rerun uses `sorted_hnsw` `ef_construction=200` and `ef_search=32`, and the harness reconnects after build before timing ordered scans. | Method | p50 latency | Recall@10 | Notes | |--------|:-----------:|:---------:|-------| | Exact heap (`svec`) | 458.762 ms | 100.0% | brute-force GT on restored corpus | | **`sorted_hnsw` (`svec`)** | **1.287 ms** | **100.0%** | `ef_construction=200`, `ef_search=32`, index 404 MB, total 1902 MB | | `sorted_hnsw` (`hsvec`) | 1.404 ms | 100.0% | `ef_construction=200`, `ef_search=32`, index 404 MB, total 1032 MB | | pgvector HNSW (`halfvec`) | 2.031 ms | 99.8% | `ef_search=64`, index 804 MB, total 1615 MB | | zvec HNSW | 50.499 ms | 100.0% | in-process collection, `ef=64`, ~1.12 GiB on disk | | Qdrant HNSW | 6.028 ms | 99.2% | local Docker on same AWS host, `hnsw_ef=64`, 103,260 points | The precision-matched PostgreSQL comparison on Gutenberg is now `sorted_hnsw (hsvec)` vs pgvector `halfvec`: `1.404 ms @ 100.0%` versus `2.031 ms @ 99.8%`, with total footprint `1032 MB` versus `1615 MB`. The raw fastest PostgreSQL row on this corpus is still `sorted_hnsw (svec)` at `1.287 ms`, but that uses float32 source storage. `sorted_hnsw` keeps the same 404 MB index in both cases because the AM stores SQ8 graph state; the storage gain from `hsvec` appears in the base table and TOAST footprint instead. Synthetic 10K x 384D cosine corpus, top-10, warm query loop. PostgreSQL methods were rerun across 3 fresh builds and the table below reports median `p50` / median recall. Qdrant uses 3 warm measurement passes on one local Docker collection. | Method | p50 latency | Recall@10 | Notes | |--------|:-----------:|:---------:|-------| | Exact heap (`svec`) | 2.03 ms | 100% | Brute-force ground truth | | **sorted_hnsw** | **0.158 ms** | **100%** | `shared_cache=on`, `ef_search=96`, index ~5.4 MB | | pgvector HNSW (`vector`) | 0.446 ms | 90% median (90-95 range) | `ef_search=64`, same `M=16`, `ef_construction=64`, index ~2.0 MB | | zvec HNSW | 0.611 ms | 100% | local in-process collection, `ef=64` | | Qdrant HNSW | 1.94 ms | 100% | local Docker, `hnsw_ef=64` | ### Current local real-dataset sample (`nytimes-256-angular`) Repo-owned harness: - `python3 scripts/bench_ann_real_dataset.py --dataset nytimes-256 --sample-size 10000 --queries 20 --k 10 --pgv-ef 64 --sh-ef 96 --zvec-ef 64 --qdrant-ef 64` ANN-Benchmarks `nytimes-256-angular`, sampled to 10K base vectors and 20 queries, top-10. The table below reports medians across 3 full harness runs. Ground truth comes from exact PostgreSQL heap search on the sampled `svec` corpus. | Method | p50 latency | Recall@10 | Notes | |--------|:-----------:|:---------:|-------| | Exact heap (`svec`) | 1.557 ms | 100% | brute-force ground truth | | **sorted_hnsw** | **0.327 ms** | **85.0% median** (83.5-85.5 range) | `shared_cache=on`, `ef_search=96`, index ~4.1 MB | | pgvector HNSW (`vector`) | 0.751 ms | 79.0% median (78.5-79.0 range) | `ef_search=64`, same `M=16`, `ef_construction=64`, index ~13 MB | | zvec HNSW | 0.403 ms | 99.5% | local in-process collection, `ef=64`, ~14.1 MB on disk | | Qdrant HNSW | 1.704 ms | 99.5% | local Docker, `hnsw_ef=64` | This corpus is materially harder than the deterministic synthetic one. It is a better signal for default-parameter recall, while the synthetic table remains useful for controlled same-host engine comparisons and regression tracking. ### Current local GraphRAG benchmark (`person -> parent -> city`, stable fact contract) Repo-owned harness: - `python3 scripts/bench_graph_rag_multihop.py --num-pairs 5000 --query-count 64 --runs 3 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 --pgv-ef-search 64 --zvec-ef 64 --qdrant-ef 64 --shared-buffers-mb 64 --backend-mode fresh` Deterministic fact graph, `5K` chains / `10K` rows total, `384D`, top-10. This is the current balanced stable GraphRAG point for the narrow fact-shaped fact-retrieval workload. The stable-facing SQL entry point is `sorted_heap_graph_rag(...)`; the table below also shows the underlying helper and wrapper paths because the harness measures the dispatched execution path directly. | Method | p50 latency | hit@1 | hit@k | Notes | |--------|:-----------:|:-----:|:-----:|-------| | Heap two-hop SQL | 0.762 ms | 75.0% | 96.9% | exact rerank over expanded heap set | | sorted_heap_graph_rag_twohop_scan() | 0.720 ms | 75.0% | 96.9% | old city-only rerank contract | | sorted_heap_expand_twohop_rerank() | 0.712 ms | 75.0% | 96.9% | same city-only seed point | | sorted_heap SQL `pathsum` baseline | 0.847 ms | 98.4% | 98.4% | `hop1_distance + hop2_distance` in SQL | | sorted_heap_graph_rag_twohop_path_scan() | 0.739 ms | 98.4% | 98.4% | fused path-aware wrapper | | **sorted_heap_expand_twohop_path_rerank()** | **0.726 ms** | **98.4%** | **98.4%** | same knobs, fused path-aware helper | | pgvector HNSW + heap expansion | 2.588 ms | 90.6% | 90.6% | path-aware rerank, `ef_search=64` | | zvec HNSW + heap expansion | 2.507 ms | 100.0% | 100.0% | path-aware rerank, `ef=64` | | Qdrant HNSW + heap expansion | 4.947 ms | 100.0% | 100.0% | path-aware rerank, `hnsw_ef=64` | The path-aware helper changes the local conclusion materially: the dominant quality issue on this fact-shaped workload was the old hop-2-only rerank contract, not seed ANN quality. The fused path-aware helper now gives the best local latency/quality point at the same `m=24`, `ef_construction=200`, `ann_k=64`, `ef_search=128` operating point. Under the same path-aware scorer contract, the current local conclusion gets sharper: `sorted_heap` keeps the latency lead, while `zvec` and Qdrant reach the strongest observed answer quality on this deterministic fact graph. A repeated-build local protocol then quantified how much of this is just single-run luck. Using three independent fresh builds of the same `5K`/`384D` balanced point: - `sorted_heap_expand_twohop_path_rerank()` median `0.798 ms`, range `0.771-0.819 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%` on every build - `sorted_heap_graph_rag_twohop_path_scan()` median `0.796 ms`, range `0.778-0.804 ms`, `hit@1 = 98.4%`, `hit@k = 98.4%` on every build - `pgvector` path-aware parity row median `1.405 ms`, `hit@1/hit@k` `85.9-89.1%` - `zvec` path-aware parity row median `1.076 ms`, `100.0% / 100.0%` - `Qdrant` path-aware parity row median `2.799 ms`, `100.0% / 100.0%` So on the local balanced point, the current `sorted_heap` path-aware rows are not a fragile one-off. The latency band is tight, and the answer quality did not drift across the three rebuilds. ### `shared_cache=on` vs `off` for multihop (post-fix, 0490cc4) A multi-index shared-cache corruption bug was fixed in commit `0490cc4`. Previously, when two or more `sorted_hnsw` indexes existed in the same database and `shared_cache=on`, a publish for the second index could overwrite shared memory that an attached cache for the first index was still reading through bare pointers. This caused silent data corruption and 0% retrieval quality on the affected index. The fix deep-copies all data (L0 neighbors, SQ8 vectors, upper-level neighbor slabs) from shared memory into local palloc'd buffers on attach. The exact failure mode is regression-guarded by the multi-index overwrite phase (B5) in `scripts/test_hnsw_chunked_cache.sh`. Post-fix verification on the `5K x 384D` multihop benchmark (fresh backends, 3 runs, `ann_k=64`, `ef_search=128`, `m=24`): ``` python3 scripts/bench_graph_rag_multihop.py \ --num-pairs 5000 --dim 384 --query-count 64 --runs 3 \ --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 \ --shared-cache {on,off} --backend-mode fresh \ --skip-zvec --skip-qdrant --skip-pgvector ``` | Method | shared_cache | p50 | hit@1 | hit@k | |--------|:----------:|:------:|:-----:|:-----:| | sorted_heap_expand_twohop_path_rerank() | off | 0.780 ms | 98.4% | 98.4% | | sorted_heap_expand_twohop_path_rerank() | on | 0.766 ms | 98.4% | 98.4% | | sorted_heap_graph_rag_twohop_path_scan() | off | 0.804 ms | 98.4% | 98.4% | | sorted_heap_graph_rag_twohop_path_scan() | on | 0.786 ms | 98.4% | 98.4% | Also verified at `10K x 384D`: ``` python3 scripts/bench_graph_rag_multihop.py \ --num-pairs 10000 --dim 384 --query-count 64 --runs 3 \ --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 \ --shared-cache {on,off} --backend-mode fresh \ --skip-zvec --skip-qdrant --skip-pgvector ``` | Method | shared_cache | p50 | hit@1 | hit@k | |--------|:----------:|:------:|:-----:|:-----:| | sorted_heap_graph_rag_twohop_path_scan() | off | 0.904 ms | 95.3% | 96.9% | | sorted_heap_graph_rag_twohop_path_scan() | on | 0.972 ms | 95.3% | 96.9% | Quality is identical between `on` and `off` at both measured scales. Latency at these measured scales is mixed rather than a clear universal win. `shared_cache=off` is no longer needed as a correctness workaround. **Larger-scale cold-start measurement (`100K x 32D`, 200K rows, 47 MB index):** The deep-copy fix means attach now copies all L0 neighbors (~38 MB), SQ8 data (~6 MB), and node metadata (~6 MB) from shared memory into local buffers on each fresh backend's first query. At 200K nodes, this upfront cost exceeds the lazy page-decode path used by `shared_cache=off`: ``` # 100K pairs, 32D, m=24, ef_construction=200, ef_search=128, 20 fresh backends per mode # PG buffer cache warm (shared_buffers=256MB): shared_cache=off: cold KNN p50=6.31ms shared_cache=on: cold KNN p50=8.68ms # PG buffer cache cold (PG restarted between modes, shared_buffers=64MB): shared_cache=off: cold KNN first=5.3ms p50=5.7ms shared_cache=on: cold KNN first=8.6ms p50=9.5ms ``` The `off` path loads L0 pages lazily (only pages visited by beam search). The `on` path deep-copies all bulk data upfront under LWLock. At 200K nodes, the upfront copy dominates. Quality remains identical. Current conclusion: `shared_cache=on` is a correctness-safe default but not a performance feature at the measured scales. The deep-copy overhead from the multi-index fix neutralized the original latency benefit. ### Synthetic multi-hop depth scaling (`relation_path` depth 1..5) Repo-owned harness: - `python3 scripts/bench_graph_rag_multidepth.py --num-pairs 5000 --max-depth 5 --query-count 32 --runs 3 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 200 --m 24 --shared-buffers-mb 64` Deterministic chain-shaped fact graph, `25K` rows total, `384D`, top-10, fresh backend, path-aware scorer. This is a narrow scaling check for the generic unified syntax: - `relation_path := ARRAY[1]` - `relation_path := ARRAY[1,2]` - `relation_path := ARRAY[1,2,3]` - `relation_path := ARRAY[1,2,3,4]` - `relation_path := ARRAY[1,2,3,4,5]` | Method | depth 1 | depth 2 | depth 3 | depth 4 | depth 5 | Quality | |--------|:-------:|:-------:|:-------:|:-------:|:-------:|---------| | Heap SQL path baseline | 0.573 ms | 0.589 ms | 0.613 ms | 0.622 ms | 0.622 ms | `100.0% / 100.0%` | | sorted_heap SQL path baseline | 0.774 ms | 0.732 ms | 0.776 ms | 0.712 ms | 0.772 ms | `100.0% / 100.0%` | | **sorted_heap_graph_rag(..., score_mode := 'path')** | **0.674 ms** | **0.651 ms** | **0.643 ms** | **0.633 ms** | **0.649 ms** | **`100.0% / 100.0%`** | On this synthetic chain benchmark, the unified path-aware function does not show a latency cliff through depth `5`; it stays in the same `~0.63-0.67 ms` band while preserving `100.0% / 100.0%` quality. This is a controlled scaling signal, not a claim about arbitrary deep graph workloads. ### Larger synthetic GraphRAG scale envelope (`1M` and `10M`) Repo-owned harnesses: - local: `python3 scripts/bench_graph_rag_multidepth.py --num-pairs 200000 --max-depth 5 --query-count 8 --runs 1 --dim 384 --ann-k 64 --top-k 10 --ef-search 128 --ef-construction 64 --m 24 --shared-buffers-mb 64 --max-wal-size-gb 32 --maintenance-work-mem-mb 4096 --table-scope sorted_heap_only` - AWS: `NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=1 RUNS=1 DIM=32 ANN_K=16 TOP_K=10 EF_SEARCH=32 EF_CONSTRUCTION=8 M=8 SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 MAINTENANCE_WORK_MEM_MB=2048 TABLE_SCOPE=sorted_heap_only ./scripts/bench_graph_rag_multidepth_aws.sh /path/to/repo 65494` Current verified scale picture: - local `1M` rows (`200K` pairs x `5` hops, `384D`, cheap-build point): - `generate_csv`: `162.667 s` - `load_data`: `45.694 s` - `build_indexes`: `377.968 s` - unified path-aware `sorted_heap_graph_rag(...)` p50: - depth `1`: `1.250 ms` - depth `2`: `1.429 ms` - depth `3`: `1.777 ms` - depth `4`: `2.181 ms` - depth `5`: `2.604 ms` - quality stayed `100.0% / 100.0%` at every depth - local `10M x 384D` no longer dies in Python generation after the streaming CSV change; the process stayed near `31 MiB RSS` on the early run where the old generator previously blew up. - local `10M x 64D` and AWS `10M x 32D` both survive generation and load, but the first practical frontier is still the single `sorted_hnsw CREATE INDEX` build, not query execution. The strongest bounded AWS signal so far is: - `10M x 32D` (`2M` pairs x `5` hops, ultralight build point) - `generate_csv`: `223.114 s` - CSV size: `3.63 GiB` - `load_data`: `485.177 s` - on the old build path, it then entered a long `CREATE INDEX` that was still active after `~11-12` minutes in the index-build phase - temp-cluster footprint reached about `17-18 GiB` with `11-13 GiB` disk still free on the AWS host After removing the per-search `visited[]` allocation/zeroing from the hot HNSW build loop, the same local diagnostic point that previously established the bottleneck changed materially: - local `500K x 32D` (`100K` pairs x `5` hops, `m=8`, `ef_construction=8`) - old total build: `18.1-18.7 s` - old graph-construction phase: about `18.27 s` - new total build: `2.777-2.996 s` - new graph-construction phase: about `2.42-2.59 s` - page-writing tail stayed small (`~0.20-0.24 s`) That moved the AWS `10M x 32D` scale branch from "still stuck in CREATE INDEX" to the first real `10M` query pass on the same cheap-build point: - `generate_csv`: `222.814 s` - `load_data`: `468.923 s` - `build_indexes`: `212.553 s` - `analyze`: `14.562 s` - unified path-aware `sorted_heap_graph_rag(...)` p50: - depth `1`: `2.124 ms` - depth `2`: `3.433 ms` - depth `3`: `4.611 ms` - depth `4`: `5.846 ms` - depth `5`: `5.139 ms` This is **not** a release-quality GraphRAG point. The build/query knobs here were intentionally ultra-light (`ann_k=16`, `ef_search=32`, `ef_construction=8`, `m=8`) to get the first `10M` latency envelope. At those settings, answer quality on the single-query AWS probe was poor (`0.0% / 0.0%`) and should be treated as a scale smoke, not a publishable quality frontier. I then kept that AWS temp cluster alive and ran query-only sweeps on the same cheap-built `10M x 32D` graph to test whether query-time tuning alone could recover depth-`5` quality without paying for another `CREATE INDEX`. It did not: - `ef_search=128`, `ann_k=64` -> depth-`5` `2.626 ms`, `0.0% / 0.0%` - `ef_search=128`, `ann_k=128` -> depth-`5` `3.040 ms`, `0.0% / 0.0%` - `ef_search=256`, `ann_k=128` -> depth-`5` `3.284 ms`, `0.0% / 0.0%` - `ef_search=256`, `ann_k=256` -> depth-`5` `4.076 ms`, returned `10` rows, but still `0.0% / 0.0%` - one pathological point, `ef_search=32`, `ann_k=64`, spiked to `1833.809 ms` and `259709.5` shared reads while still returning `0.0% / 0.0%` I then ran the same question on a smaller but shape-matched local `1M x 32D` graph (`200K` pairs x `5` hops) to separate build quality from query contract: - cheap local build points (`m=8`, `ef_construction=8/32`, `m=16`, `ef_construction=32/64`) stayed weak at the old narrow query contract (`ann_k=64`, `top_k=10`), topping out around `25.0% hit@k` - but with a wider query contract on the same `1M x 32D` graph (`ann_k=256`, `top_k=32`, `ef_search=128`), both - the strong build (`m=24`, `ef_construction=200`), and - the much cheaper build (`m=16`, `ef_construction=64`) reached the same `96.9% hit@k` over `32` queries - exact heap seeds on that `1M x 32D` graph matched the ANN result exactly at the widened point, so once the query budget is large enough, heavier builds were no longer the decisive lever there That suggested a second AWS `10M x 32D` probe on the same cheap build but with the widened local winner (`ann_k=256`, `top_k=32`, `ef_search=128`). The result still failed: - ANN path-aware `sorted_heap_graph_rag(...)`: - `top_k=10` -> `1832.345 ms`, `0.0% / 0.0%` - `top_k=32` -> `1832.755 ms`, `0.0% / 0.0%` - exact heap seeds + the same path-aware expansion/rerank contract: - `ann_k=256`, `top_k=32` -> `0.0% / 0.0%` So the cheap-build `10M x 32D` graph can produce latency numbers, but its quality is not recoverable by either: - larger query-time ANN budgets, or - exact heap seeds on the same low-dimensional corpus The `10M x 32D` frontier is therefore no longer "weak HNSW build quality." It is the low-dimensional scale contract itself. The next meaningful scale branch is a higher-dimensional `10M` point or a different retrieval contract, not another `m/ef_construction` tweak on the same `32D` setup. The first higher-dimensional calibration point was `64D`, and it changed the local picture immediately. On a local `1M x 64D` graph (`200K` pairs x `5` hops), a relatively cheap build (`m=16`, `ef_construction=64`) plus the wider query contract (`ann_k=256`, `top_k=32`, `ef_search=128`) gave: - ANN path-aware `sorted_heap_graph_rag(...)`: `65.6% hit@1`, `96.9% hit@k` - exact heap seeds + the same path-aware expansion/rerank contract: `65.6% hit@1`, `96.9% hit@k` At `top_k=10`, the same `1M x 64D` point already held `96.9% hit@k`, so unlike `32D`, the result-budget cliff largely disappeared there. The first file-backed `10M x 64D` AWS attempt on the current `ubuntu@dev.rigelstar.com` host (`4 vCPU`, `8 GiB RAM`) failed for operational reasons before query timing: - `generate_csv`: `414.397 s` - CSV size: `6.67 GiB` - temp dir grew to about `28 GiB` - `/tmp` fell to `1.3 GiB` free (`99%` used) while the run was still active That led to a footprint reduction pass in the harness: - stream fact rows directly into `COPY` instead of materializing a giant CSV - drop `facts_heap` after loading `facts_sh` in `sorted_heap_only` mode - allow query-only reuse from a kept temp cluster - when `facts_heap` is not needed, copy rows straight into `facts_sh` before `sorted_heap_compact(...)` instead of staging through `facts_heap` and `INSERT .. ORDER BY` That direct `facts_sh` load path held up on bounded local checks: - `200K` rows (`40K` pairs x `5` hops, `64D`, `hop_weight=0.05`) - old staged path: `6321.453 ms` load, depth-5 unified GraphRAG `24.504 ms`, `100.0% / 100.0%` - direct `COPY facts_sh`: `5638.134 ms` load, depth-5 unified GraphRAG `24.312 ms`, `100.0% / 100.0%` - `1M` rows (`200K` pairs x `5` hops, `64D`, load only) - old staged path: `31.392 s` - direct `COPY facts_sh`: `28.231 s` So the current conclusion is narrow but useful: for synthetic `sorted_heap_only` multidepth runs, direct `COPY facts_sh` is a real `~10%` ingest win without giving up the compacted query-time locality that the earlier "skip compaction" falsifier showed we still need. I also bounded the obvious follow-up on the same direct ordered load: parameterizing the post-load maintenance step as `none`, `merge`, or `compact`. - `200K` rows (`40K` pairs x `5` hops, `64D`, `hop_weight=0.05`) - `none`: `5159.888 ms` load, depth-5 unified GraphRAG `62.148 ms` - `merge`: `5749.190 ms` load, depth-5 unified GraphRAG `23.277 ms` - `compact`: `5626.887 ms` load, depth-5 unified GraphRAG `24.591 ms` - `1M` rows (`200K` pairs x `5` hops, `64D`, load only) - `none`: `25.820 s` - `merge`: `28.142 s` - `compact`: `28.108 s` So `merge` is now exposed as an experiment knob in the multidepth harness, but it is not a proven new default. The current evidence says: - `none` is too expensive at query time - `merge` is viable on ordered synthetic loads - `merge` does not materially beat `compact` on the larger `1M` load point That leaves `compact` as the stable default for large-scale multidepth runs. With that lower-footprint path, the same `10M x 64D` AWS point advanced materially further on the same host: - streamed `generate_csv`: `0.000 s` (`csv_bytes=0`, no materialized CSV) - `load_data`: `916.030 s` - temp dir plateaued around `19 GiB` - filesystem still had about `11 GiB` free at the start of `CREATE INDEX` The next frontier was no longer disk headroom. On the same `10M x 64D` cheap-build point (`m=16`, `ef_construction=64`), the old local scan-cache seeding path then failed inside PostgreSQL: - `CREATE INDEX facts_sh_ann_idx ON facts_sh USING sorted_hnsw (embedding) WITH (m = 16, ef_construction = 64)` - `ERROR: invalid memory alloc request size 1280000000` That request size matched the old contiguous local L0 neighbor cache layout (`n_nodes * 2 * M * sizeof(int32)` at `10M x M=16`), so the next fix was to replace local `l0_neighbors + sq8_data` slabs with page-backed storage. Shared immutable scan caches remain contiguous; local build seeding and local `shnsw_load_cache()` no longer require a single giant L0 allocation. With the chunked local-cache fix in place, the next retained branch was the constrained-memory rerun via: - `sorted_hnsw.build_sq8 = on` - lower-hop synthetic contract (`hop_weight = 0.05`) On the same AWS ARM64 host (`4 vCPU`, `8 GiB RAM`, `4 GiB swap`), the monolithic `10M x 64D` point then completed cleanly: - `load_data`: `787.809 s` - `build_indexes`: `846.795 s` - kept temp cluster: `/home/ubuntu/graphrag_tmp/graph_rag_y8cuntf9` That is a real constrained-memory result: the same host that previously hit early large-scale frontiers now built the full monolithic `sorted_hnsw` index on `10,000,000` rows without being OOM-killed. The first query-only reuse pass on that exact built graph (`query_count=4`, `runs=1`, `ann_k=256`, `top_k=32`, `ef_search=128`, `sorted_heap_only`) gave: - depth 1 - SQL path: `1072.525 ms`, `100.0% / 100.0%` - unified GraphRAG path: `840.607 ms`, `100.0% / 100.0%` - depth 2 - SQL path: `1070.437 ms`, `75.0% / 100.0%` - unified GraphRAG path: `2055.479 ms`, `75.0% / 100.0%` - depth 3 - SQL path: `1050.793 ms`, `50.0% / 100.0%` - unified GraphRAG path: `2070.894 ms`, `50.0% / 100.0%` - depth 4 - SQL path: `1080.123 ms`, `50.0% / 100.0%` - unified GraphRAG path: `2066.717 ms`, `50.0% / 100.0%` - depth 5 - SQL path: `1064.405 ms`, `75.0% / 100.0%` - unified GraphRAG path: `2084.155 ms`, `75.0% / 100.0%` So the `10M x 64D` monolithic branch is now narrowed sharply: - constrained-memory monolithic build is **viable** - quality stayed aligned between the SQL baseline and the unified GraphRAG path - the remaining issue is not build survival and not quality drift - the remaining issue is latency: at depth `2+`, the current generic unified GraphRAG path is still about `2x` slower than the SQL path baseline on this host and graph That is exactly why the next scale story should move toward segmentation + pruning rather than trying to turn one monolithic `10M` graph into the final operating model for small hosts. I then added an opt-in stage breakdown path to the multidepth harness via `--report-stage-stats`, using `sorted_heap_graph_rag_stats()` after each unified path-aware call. On the local `1M x 64D` lower-hop point (`hop_weight=0.05`, `ann_k=256`, `top_k=32`, `ef_search=128`), that narrowed the runtime picture sharply: - depth 2: - end-to-end unified GraphRAG: `109.222 ms` - internal stage stats: - `ann_ms = 107.590` - `expand_ms = 0.369` - `rerank_ms = 0.004` - depth 5: - end-to-end unified GraphRAG: `110.507 ms` - internal stage stats: - `ann_ms = 109.178` - `expand_ms = 0.691` - `rerank_ms = 0.011` So the current `1M x 64D` depth cost is not expansion-bound. On the widened contract, almost all measured time is already in the ANN seed stage; the multi-hop expansion and rerank work remain sub-millisecond. I then added an opt-in low-memory `sorted_hnsw` build path via `SET sorted_hnsw.build_sq8 = on`. This keeps the final on-disk/index-query contract the same, but the graph is built from SQ8-compressed build vectors instead of a full float32 build slab. The tradeoff is one extra heap scan during `CREATE INDEX`. Bounded local A/B on the same `1M x 64D` lower-hop point (`m=16`, `ef_construction=64`, `ann_k=256`, `top_k=32`, `ef_search=128`, `sorted_heap_only`, streamed load): - `build_sq8=off` - `load_data`: `28.044 s` - `build_indexes`: `48.606 s` - unified depth-5 GraphRAG: `111.578 ms`, `87.5% / 100.0%` - `build_sq8=on` - `load_data`: `27.935 s` - `build_indexes`: `46.541 s` - unified depth-5 GraphRAG: `110.541 ms`, `87.5% / 100.0%` So the first retained result is narrow but real: - the low-memory build path did **not** regress the measured `1M x 64D` GraphRAG quality point - it slightly improved build time on that point instead of paying a visible penalty for the extra heap scan - and by construction it cuts the build-vector slab from `4 * N * D` bytes to `1 * N * D` - `10M x 64D`: about `2.56 GiB -> 0.64 GiB` - `10M x 384D`: about `15.36 GiB -> 3.84 GiB` So the narrow conclusion is: - the multi-hop GraphRAG path itself survives to at least `1M` rows and gives measured query latencies there - the old `10M` frontier was **build-bound**, not Python-memory-bound, not CSV-generation-bound, and not an early PostgreSQL failure - the current build optimization moved that frontier enough to obtain the first real `10M` query numbers on a cheap-build point - the remaining `10M x 32D` problem is no longer "can it finish?" and no longer just "is the HNSW build too weak?" - the stronger falsifier is that even exact seeds fail at that scale and dimensionality, so the next branch is a different scale contract, not a narrower HNSW tuning loop - `64D` is the first scale contract that looks healthy locally at `1M` - on the current AWS host, the `10M x 64D` branch now clears ingest, build, and the first query pass; the remaining cheap-build frontier is deep-path quality, not allocator failure I then added a segmented multidepth harness in `scripts/bench_graph_rag_multidepth_segmented.py`. This is the first partitioning benchmark, but it is intentionally **harness-side**, not an extension API: current GraphRAG helpers require a concrete `sorted_heap` table, so the benchmark fans out across multiple `sorted_heap` shards and merges the shard-local top-k rows in Python. The local `1M x 64D` lower-hop point (`200K` pairs, `5` hops, `32` queries, `ann_k=256`, `top_k=32`, `ef_search=128`, `m=16`, `ef_construction=64`, `build_sq8=on`) gives a clean first answer: - monolithic unified GraphRAG: - depth 1: `50.104 ms`, `100.0% / 100.0%` - depth 5: `121.524 ms`, `81.2% / 100.0%` - segmented into `8` shards, **all-shard fanout**: - build_indexes: `57.885 s` versus monolithic `45.714 s` - depth 1: `87.677 ms`, `100.0% / 100.0%` - depth 5: `142.472 ms`, `81.2% / 100.0%` - segmented into `8` shards, **exact routing** to the owning shard: - depth 1: `10.574 ms`, `100.0% / 100.0%` - depth 5: `16.822 ms`, `100.0% / 100.0%` That makes the partitioning lesson explicit: - segmentation by itself is **not** a free latency win - all-shard fanout preserves quality on this benchmark, but pays a clear fanout tax - the real win appears only when routing/pruning can avoid most shards - so the next scale story should be "partitioning + pruning contract", not just "more shards" This is the correct shape for large knowledge bases: - on constrained hosts, sharding bounds per-index build memory - on real workloads, the query path must also prune by tenant / knowledge-base / relation family / time window (or a future segment router), otherwise the system only trades one monolith for a broad fanout query That local result transferred cleanly to the constrained AWS host on the first full `10M x 64D` streamed segmented run (`2M` pairs, `5` hops, `8` shards, `hop_weight=0.05`, `ann_k=256`, `top_k=32`, `ef_search=128`, `m=16`, `ef_construction=64`, `build_sq8=on`): - streamed segmented build-only: - `generate_csv`: `0.000 s` - `load_data`: `500.474 s` - `build_indexes`: `784.778 s` - segmented, `route=exact` query-only reuse on the same built graph: - depth 1: `126.057 ms`, `100.0% / 100.0%` - depth 2: `261.986 ms`, `75.0% / 100.0%` - depth 3: `259.794 ms`, `75.0% / 100.0%` - depth 4: `258.879 ms`, `75.0% / 100.0%` - depth 5: `258.766 ms`, `100.0% / 100.0%` - segmented, `route=all` query-only reuse on the same built graph: - depth 1: `898.440 ms`, `100.0% / 100.0%` - depth 2: `2090.866 ms`, `75.0% / 100.0%` - depth 3: `2089.650 ms`, `50.0% / 100.0%` - depth 4: `2088.114 ms`, `50.0% / 100.0%` - depth 5: `2093.652 ms`, `75.0% / 100.0%` That gives the first full large-scale constrained-memory comparison on the same AWS box: - monolithic `10M x 64D`, low-memory build: - depth 1 unified GraphRAG: `840.607 ms`, `100.0% / 100.0%` - depth 5 unified GraphRAG: `2084.155 ms`, `75.0% / 100.0%` - segmented `10M x 64D`, `route=all`: - effectively the same latency/quality envelope as the monolith - segmented `10M x 64D`, `route=exact`: - about `6.7x` faster than the monolith at depth 1 - about `8.1x` faster than the monolith at depth 5 - and still stable at `100.0% / 100.0%` for depth 5 on this benchmark So the large-scale result now matches the local `1M` lesson: - segmentation without pruning is not a performance story - streamed segmented ingest is operationally better than front-loaded shard CSV materialization - segmented routing is the first constrained-memory scale path that improves both build viability and query latency on the same host The next local step was to move routing out of benchmark-only Python and into a real SQL reference path. On a kept local `5K`-row segmented smoke cluster (`4` shards, `64D`, `ann_k=32`, `top_k=8`, `ef_search=64`), the new range-routed SQL path matched the segmented SQL merge baseline on quality and row counts: - segmented SQL merge, `route=exact`, depth 5: - `0.215 ms`, `100.0% / 100.0%` - metadata-routed SQL wrapper, same exact-route point: - `0.245 ms`, `100.0% / 100.0%` So the current reference stack is now: - `sorted_heap_graph_rag_segmented(...)` for explicit candidate shard arrays - `sorted_heap_graph_rag_routed(...)` for simple metadata-driven range routing The next local routing pass added an exact-key companion for tenant / KB style selection. On another kept local `5K`-row segmented smoke cluster (`4` shards, `64D`, `ann_k=32`, `top_k=8`, `ef_search=64`), the exact-key routed path stayed aligned with the exact-route segmented SQL merge path: - exact-route segmented SQL merge, depth 5: - `0.183 ms`, `100.0% / 100.0%` - exact-key routed SQL wrapper, same point: - `0.202 ms`, `100.0% / 100.0%` That is still not the final routing story for large knowledge bases, but it is the first usable SQL-level bridge from harness-side segmentation to productized segmented GraphRAG, with both: - range-based routing - exact-key routing ### Monolithic vs segmented exact-routed: constrained-memory `10M x 64D` comparison This table consolidates the AWS `10M x 64D` results already reported above (monolithic in "Larger synthetic GraphRAG scale envelope", segmented in "Segmented GraphRAG at scale") into a direct side-by-side. Both runs used the same AWS ARM64 host (`4 vCPU`, `8 GiB RAM`, `4 GiB swap`), the same workload (`2M` pairs x `5` hops, `hop_weight=0.05`), and the same build/query knobs (`m=16`, `ef_construction=64`, `build_sq8=on`, `ann_k=256`, `top_k=32`, `ef_search=128`). Segmented mode used `8` shards. | | Monolithic | Segmented exact | Segmented all | |---|:---:|:---:|:---:| | load_data | 787.8 s | 500.5 s | — | | build_indexes | 846.8 s | 784.8 s | — | | depth 1 p50 | 840.6 ms | **126.1 ms** | 898.4 ms | | depth 5 p50 | 2084.2 ms | **258.8 ms** | 2093.7 ms | | depth 1 hit@1/hit@k | 100%/100% | 100%/100% | 100%/100% | | depth 5 hit@1/hit@k | 75%/100% | **100%/100%** | 75%/100% | | speedup at depth 5 | 1x | **8.1x** | ~1x | Exact commands (monolithic build + query-only reuse): ``` NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \ ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \ SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \ BUILD_SQ8=on TABLE_SCOPE=sorted_heap_only \ ./scripts/bench_graph_rag_multidepth_aws.sh 65493 ``` Exact commands (segmented build + query-only reuse): ``` NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \ ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \ SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \ BUILD_SQ8=on SHARDS=8 ROUTE=both \ ./scripts/bench_graph_rag_multidepth_segmented_aws.sh 65494 ``` Conclusion: on the measured `10M x 64D` constrained-memory point, segmented exact routing is **8.1x faster** at depth 5 with **better quality** (100%/100% vs 75%/100%), while build time is comparable. All-shard fanout offers no latency benefit — the win comes entirely from shard pruning. This is not a universal claim. Exact routing requires knowing which shard owns the query entity, which maps to tenant-id / knowledge-base-id style workloads. Queries that cannot be pruned to a single shard will see all-shard-fanout latency (~1x monolithic). ### Bounded fanout: routing robustness on `1M x 64D` This measures how the segmented win degrades when routing is imperfect. Instead of routing to exactly one shard (exact) or all shards (all), bounded fanout routes to K adjacent shards (always including the correct one). This simulates imperfect routing where the router knows roughly which shard, but hedges by including neighbors. Local, `200K` pairs x `5` hops, `64D`, `8` shards, `32` queries, `3` runs, `ann_k=256`, `top_k=32`, `ef_search=128`, `m=16`, `ef_construction=64`, `build_sq8=on`, `hop_weight=0.05`: ``` python3 scripts/bench_graph_rag_multidepth_segmented.py \ --num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \ --dim 64 --ann-k 256 --top-k 32 --ef-search 128 \ --ef-construction 64 --m 16 --build-sq8 on \ --shards 8 --route {exact,bounded,all} --fanout {2,4} --hop-weight 0.05 ``` Monolithic baseline (`sorted_heap_graph_rag(...)` on same workload): ``` python3 scripts/bench_graph_rag_multidepth.py \ --num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \ --dim 64 --ann-k 256 --top-k 32 --ef-search 128 \ --ef-construction 64 --m 16 --build-sq8 on \ --hop-weight 0.05 --table-scope sorted_heap_only ``` Depth-5 comparison (the hardest hop): | Route mode | Shards hit | p50 (d5) | hit@1 (d5) | hit@k (d5) | Speedup vs monolith | |---|:---:|:---:|:---:|:---:|:---:| | monolithic | 1 table | 120.7 ms | 81.2% | 100.0% | 1x | | segmented all (8/8) | 8 | 149.5 ms | 81.2% | 100.0% | 0.8x | | segmented bounded (4/8) | 4 | 74.1 ms | 93.8% | 100.0% | 1.6x | | segmented bounded (2/8) | 2 | 35.9 ms | 96.9% | 100.0% | 3.4x | | segmented exact (1/8) | 1 | 17.2 ms | 100.0% | 100.0% | **7.0x** | Depth-1 comparison: | Route mode | Shards hit | p50 (d1) | hit@1 (d1) | hit@k (d1) | |---|:---:|:---:|:---:|:---:| | monolithic | 1 table | 49.4 ms | 100.0% | 100.0% | | segmented all (8/8) | 8 | 94.3 ms | 100.0% | 100.0% | | segmented bounded (4/8) | 4 | 46.3 ms | 100.0% | 100.0% | | segmented bounded (2/8) | 2 | 22.8 ms | 100.0% | 100.0% | | segmented exact (1/8) | 1 | 11.3 ms | 100.0% | 100.0% | Observations: - Latency scales roughly linearly with the number of shards hit. - Quality degrades gracefully: bounded(2/8) still reaches 96.9% hit@1 at depth 5 vs 100.0% for exact, compared to 81.2% for monolithic/all. - Even bounded(4/8) at half the shards is 1.6x faster than monolithic and has better quality (93.8% vs 81.2% hit@1). - All-shard fanout is worse than monolithic (fanout overhead dominates). - The win is not exact-or-nothing — bounded fanout preserves most of the benefit even with imperfect routing. ### Bounded fanout at AWS `10M x 64D` scale Same AWS ARM64 host (`4 vCPU`, `8 GiB RAM`, `4 GiB swap`), same workload as the `10M x 64D` comparison above (`2M` pairs x `5` hops, `hop_weight=0.05`, `8` shards, `ann_k=256`, `top_k=32`, `ef_search=128`, `m=16`, `ef_construction=64`, `build_sq8=on`). Build once, query-only reuse for each route mode. `4` queries, `1` run. Build: `load_data=500.7s`, `build_indexes=764.2s`. Query commands: ``` # Build once with --keep-temp --stop-after build_indexes NUM_PAIRS=2000000 MAX_DEPTH=5 QUERY_COUNT=4 RUNS=1 DIM=64 \ ANN_K=256 TOP_K=32 EF_SEARCH=128 EF_CONSTRUCTION=64 M=16 \ SHARED_BUFFERS_MB=64 MAX_WAL_SIZE_GB=16 HOP_WEIGHT=0.05 \ BUILD_SQ8=on SHARDS=8 ROUTE=exact \ EXTRA_ARGS="--stream-copy --keep-temp --stop-after build_indexes" \ bash scripts/bench_graph_rag_multidepth_segmented_aws.sh \ 65494 # Query-only reuse for each route mode: ssh "cd && python3 scripts/bench_graph_rag_multidepth_segmented.py \ --port 65494 --num-pairs 2000000 --max-depth 5 --query-count 4 --runs 1 \ --dim 64 --hop-weight 0.05 --ann-k 256 --top-k 32 --ef-search 128 \ --ef-construction 64 --m 16 --build-sq8 on --shared-buffers-mb 64 \ --shards 8 --route {exact,bounded,all} [--fanout {2,4}] \ --backend-mode fresh --reuse-temp " ``` Depth-5 comparison (monolithic baseline from prior section): | Route mode | Shards hit | p50 (d5) | hit@1 (d5) | hit@k (d5) | vs monolith | |---|:---:|:---:|:---:|:---:|:---:| | monolithic | 1 table | 2084 ms | 75% | 100% | 1x | | segmented all (8/8) | 8 | 2107 ms | 75% | 100% | ~1x | | segmented bounded (4/8) | 4 | 1042 ms | 75% | 100% | **2.0x** | | segmented bounded (2/8) | 2 | 520 ms | 100% | 100% | **4.0x** | | segmented exact (1/8) | 1 | 259 ms | 100% | 100% | **8.0x** | Depth-1 comparison: | Route mode | Shards hit | p50 (d1) | hit@1 (d1) | hit@k (d1) | vs monolith | |---|:---:|:---:|:---:|:---:|:---:| | monolithic | 1 table | 841 ms | 100% | 100% | 1x | | segmented all (8/8) | 8 | 920 ms | 100% | 100% | ~1x | | segmented bounded (4/8) | 4 | 473 ms | 100% | 100% | **1.8x** | | segmented bounded (2/8) | 2 | 233 ms | 100% | 100% | **3.6x** | | segmented exact (1/8) | 1 | 137 ms | 100% | 100% | **6.1x** | The bounded-fanout gradient from the local `1M` point transfers cleanly to `10M`: - Latency scales linearly with the number of shards hit at both scales. - bounded(2/8) at `10M` is 4.0x faster than monolithic at depth 5 — close to the local 3.4x ratio. - Even bounded(4/8) at half the shards gives a 2.0x win. - Quality: hit@k stays at 100% across all modes. hit@1 varies with fanout width but stays ≥75%. - All-shard fanout remains ~1x monolithic (fanout overhead cancels the per-shard size reduction). Conclusion: bounded fanout remains materially useful on this measured `10M` synthetic point. A router that can narrow to 2 out of 8 candidate shards captures ~50% of the exact-routing speedup. The performance gradient is smooth on this benchmark, not a cliff. ### Routing-miss tolerance on `1M x 64D` This measures what happens when the router sometimes picks the wrong shards — the correct shard is not always included in the candidate set. Uses `--route bounded_recall --fanout 2 --recall-pct N` where N% of queries get the correct shard and the rest get 2 adjacent wrong shards. Miss schedule is deterministic per-query via SHA-256 hash (seed=42). Local, `200K` pairs x `5` hops, `64D`, `8` shards, `32` queries, `3` runs, `ann_k=256`, `top_k=32`, `ef_search=128`, `m=16`, `ef_construction=64`, `build_sq8=on`, `hop_weight=0.05`: ``` python3 scripts/bench_graph_rag_multidepth_segmented.py \ --num-pairs 200000 --max-depth 5 --query-count 32 --runs 3 \ --dim 64 --ann-k 256 --top-k 32 --ef-search 128 \ --ef-construction 64 --m 16 --build-sq8 on \ --shards 8 --route bounded_recall --fanout 2 \ --recall-pct {100,90,75,50,25,0} --hop-weight 0.05 ``` Depth-5 routing-miss tolerance (fanout=2, 8 shards): | Router recall | Queries hitting correct shard | p50 (d5) | hit@1 (d5) | hit@k (d5) | |:---:|:---:|:---:|:---:|:---:| | 100% | 32/32 | 54.5 ms | 96.9% | 100.0% | | 90% | 29/32 | 64.7 ms | 87.5% | 90.6% | | 75% | 23/32 | 57.0 ms | 68.8% | 71.9% | | 50% | 18/32 | 45.4 ms | 53.1% | 56.2% | | 25% | 10/32 | 79.1 ms | 31.2% | 31.2% | | 0% | 0/32 | 44.8 ms | 0.0% | 0.0% | | (ref: monolithic) | — | 120.7 ms | 81.2% | 100.0% | Observations: - Quality tracks router recall nearly linearly. There is no sharp cliff. - A router with 90% recall keeps 87.5% hit@1 — close to the monolithic baseline's 81.2%, while remaining 2-3x faster. - A router with 75% recall gives 68.8% hit@1 — below monolithic quality, so the latency win comes at a quality cost. - Latency stays roughly constant across recall levels because the fanout (2 shards) is fixed. The router-miss doesn't cost extra I/O — it just returns wrong answers. - This means: **routing quality determines answer quality, not latency**. The segmented latency win is structural (fewer rows per shard), but the quality win depends entirely on the router including the correct shard. ### Routing-miss tolerance at AWS `10M x 64D` Same AWS ARM64 host and workload as the bounded-fanout section above. Build once (`load=500.5s`, `build=803.5s`), query-only reuse. `4` queries, `1` run, `fanout=2`, `8` shards. ``` # Build: same as bounded-fanout section # Query-only reuse with bounded_recall: ssh "cd && python3 scripts/bench_graph_rag_multidepth_segmented.py \ --port 65494 --num-pairs 2000000 --max-depth 5 --query-count 4 --runs 1 \ --dim 64 --hop-weight 0.05 --ann-k 256 --top-k 32 --ef-search 128 \ --ef-construction 64 --m 16 --build-sq8 on --shared-buffers-mb 64 \ --shards 8 --route bounded_recall --fanout 2 --recall-pct {90,75} \ --backend-mode fresh --reuse-temp " ``` Depth-5 routing-miss tolerance at `10M`: | Route mode | Router recall | p50 (d5) | hit@1 (d5) | hit@k (d5) | |---|:---:|:---:|:---:|:---:| | bounded optimistic | 100% (4/4 hit) | 519 ms | 100% | 100% | | bounded_recall | 90% (3/4 hit) | 521 ms | 75% | 75% | | bounded_recall | 75% (3/4 hit) | 522 ms | 75% | 75% | | (ref: monolithic) | — | 2084 ms | 75% | 100% | | (ref: exact) | 100% | 259 ms | 100% | 100% | The 90% and 75% recall points produce the same result because with only `4` queries, SHA-256 hash gives `3/4` correct-shard inclusion at both settings. The `1` missed query returns wrong results, giving `75%` hit@1. Despite limited query-count resolution, the directional signal is clear: - Bounded(2/8) latency stays at ~520 ms regardless of recall (~4x faster than monolithic) - One missed query out of four drops hit@1 from 100% to 75% — matching the monolithic baseline's own 75% hit@1 - hit@k drops from 100% to 75% (monolithic gets 100% hit@k because its broader candidate set is more likely to include the target) This confirms the local finding: **routing quality determines answer quality, not latency.** The latency win is structural and survives routing errors. At 90% router recall on this 10M point, bounded routing matches the monolithic hit@1 (75%) while staying 4x faster. A higher `query_count` run would give finer crossover resolution. On the measured 4-query point, the evidence is narrower: at 90% recall, bounded routing matches the monolithic hit@1 while staying 4x faster, and the true crossover is not shown to be above that point. ### Current AWS GraphRAG benchmark (`person -> parent -> city`, stable fact contract) Repo-owned harness: - `REMOTE_PYTHON=/path/to/python QUERY_COUNT=64 RUNS=3 NUM_PAIRS=5000 DIM=384 ANN_K=64 TOP_K=10 EF_SEARCH=128 EF_CONSTRUCTION=200 M=24 PGV_EF_SEARCH=64 ZVEC_EF=64 QDRANT_EF=64 SHARED_BUFFERS_MB=64 BACKEND_MODE=fresh ./scripts/bench_graph_rag_multihop_aws.sh /path/to/repo 65492` AWS ARM64 host (`4 vCPU`, `8 GiB RAM`), deterministic fact graph, `5K` chains / `10K` rows total, `384D`, top-10, `64` queries, `3` runs. This is the current portable stable multihop GraphRAG point for the narrow fact-shaped contract. | Method | p50 latency | hit@1 | hit@k | Notes | |--------|:-----------:|:-----:|:-----:|-------| | Heap two-hop SQL | 1.088 ms | 75.0% | 96.9% | exact rerank over expanded heap set | | sorted_heap_expand_twohop_rerank() | 0.952 ms | 75.0% | 96.9% | older city-only helper | | sorted_heap_graph_rag_twohop_scan() | 1.012 ms | 75.0% | 96.9% | older city-only wrapper | | sorted_heap SQL `pathsum` baseline | 1.204 ms | 98.4% | 98.4% | same ANN seeds, `hop1_distance + hop2_distance` | | **sorted_heap_expand_twohop_path_rerank()** | **0.955 ms** | **98.4%** | **98.4%** | fused path-aware helper | | sorted_heap_graph_rag_twohop_path_scan() | 1.018 ms | 98.4% | 98.4% | fused path-aware wrapper | | pgvector HNSW + heap expansion | 1.422 ms | 85.9% | 85.9% | path-aware rerank, `ef_search=64` | | zvec HNSW + heap expansion | 1.720 ms | 100.0% | 100.0% | path-aware rerank, `ef=64` | | Qdrant HNSW + heap expansion | 3.435 ms | 100.0% | 100.0% | path-aware rerank, `hnsw_ef=64` | The new AWS result matches the local diagnostic cleanly: the dominant quality loss on this workload was the old hop-2-only rerank contract, not the seed ANN frontier. The path-aware helper preserves the quality gain on ARM64 with only trivial latency cost versus the older helper. On the apples-to-apples path-aware contract, the portable frontier is now: - `sorted_heap` fastest - `zvec` and Qdrant strongest on answer quality - `pgvector` still behind on both latency and quality at this operating point One intermediate AWS all-engines rerun temporarily dropped the `sorted_heap` path-aware rows to `96.9%` / `96.9%`. An immediate `sorted_heap`-only control and a second full all-engines rerun both returned the stable `98.4% / 98.4%` point above, so the published table uses the confirmed rerun rather than the single outlier. An AWS repeated-build protocol then tightened that confidence band on the same balanced `5K` point. Using three independent fresh builds: - `sorted_heap_expand_twohop_path_rerank()` median `0.962 ms`, range `0.956-0.965 ms`, `hit@1/hit@k = 98.4/98.4` on all three builds - `sorted_heap_graph_rag_twohop_path_scan()` median `1.025 ms`, range `1.018-1.043 ms`, `hit@1/hit@k = 98.4/98.4` on all three builds - `pgvector` path-aware parity row median `1.434 ms`, `hit@1/hit@k` `84.4-89.1` - `zvec` path-aware parity row median `1.711 ms`, `100.0/100.0` - `Qdrant` path-aware parity row median `3.355 ms`, `100.0/100.0` So on the current portable `5K` point, the earlier AWS outlier now looks like an anomaly rather than a broad instability. The balanced `sorted_heap` path-aware rows stayed fixed across all three rebuilds. The larger `10K`-chain AWS rerun now tells a different story than the older city-only benchmark. At the same portable point: - heap two-hop SQL: `1.319 ms`, `hit@1 71.9%`, `hit@k 92.2%` - city-only `sorted_heap_graph_rag_twohop_scan()`: `1.197 ms`, `hit@1 73.4%`, `hit@k 93.8%` - SQL `pathsum` baseline: `1.436 ms`, `hit@1 96.9%`, `hit@k 98.4%` - `sorted_heap_expand_twohop_path_rerank()`: `1.185 ms`, `hit@1 96.9%`, `hit@k 98.4%` - `sorted_heap_graph_rag_twohop_path_scan()`: `1.212 ms`, `hit@1 96.9%`, `hit@k 98.4%` So the old larger-scale caveat now narrows materially: the main `10K` loss was also the city-only rerank contract, not a fundamental collapse of the seed frontier at that scale. An AWS repeated-build protocol then checked whether the remaining `10K` difference was really a build-variance problem. Using three independent fresh builds on the same `10K` path-aware point: - `sorted_heap_expand_twohop_path_rerank()` median `1.177 ms`, range `1.148-1.191 ms`, `hit@1/hit@k = 95.3/96.9` on all three builds - `sorted_heap_graph_rag_twohop_path_scan()` median `1.236 ms`, range `1.211-1.240 ms`, `hit@1/hit@k = 95.3/96.9` on all three builds - `pgvector` path-aware parity row median `1.667 ms`, `hit@1/hit@k` `76.6-82.8` - `zvec` path-aware parity row median `2.788 ms`, `98.4/100.0` - `Qdrant` path-aware parity row median `3.818 ms`, `98.4/100.0` So the larger `10K` AWS point is now repeated-build stable too. The remaining issue is scale frontier, not build instability: the `10K` quality band is lower than `5K`, but it stayed fixed across fresh builds. An exact-seed diagnostic on the local `5K` and `10K` points did not improve `hit@1` or `hit@k` versus the ANN-seeded `sorted_heap` helper. So on this benchmark shape, the remaining gap is not explained by ANN approximation alone. The stronger result is that seed coverage itself was identical for ANN and exact seeds: `98.4%` at `5K` and `96.9%` at `10K`. So the remaining loss is downstream of seeding. The new rerank-rank diagnostic narrows that further: at `5K`, the correct city is still within the top 6 for 95% of reachable queries, and at `10K` it is still within the top 3 for 95% of reachable queries. The quality drop is therefore driven by a few severe outliers (`max rank 17` at `5K`, `20` at `10K`), not by a broad collapse. A path-aware SQL rerank baseline then changed the picture materially. Keeping the same ANN seeds and the same two-hop expansion, but scoring candidates as `hop1_distance + hop2_distance`, moved the local balanced points to: - `5K`: `0.957 ms`, `hit@1 98.4%`, `hit@k 98.4%` - `10K`: `1.179 ms`, `hit@1 95.3%`, `hit@k 96.9%` That branch is now implemented in the extension and verified on both local and AWS ARM64 runs. The fused path-aware helper measured: - local `5K`: `0.726 ms`, `hit@1 98.4%`, `hit@k 98.4%` - local `10K`: `0.823 ms`, `hit@1 95.3%`, `hit@k 96.9%` - AWS `5K`: `0.955 ms`, `hit@1 98.4%`, `hit@k 98.4%` - AWS `10K`: `1.185 ms`, `hit@1 96.9%`, `hit@k 98.4%` So the current strongest portable GraphRAG result is no longer the SQL baseline or the old city-only helper. It is the fused path-aware helper. ### Current real code-corpus GraphRAG reference benchmark (`cogniformerus` CrossFile) Repo-owned harnesses: - `python3 scripts/bench_graph_rag_code_corpus.py --runs 3 --backend-mode fresh --ann-k 16 --top-k 4` - `python3 scripts/repeat_graph_rag_code_corpus_builds.py --repeats 3 --runs 3 --backend-mode fresh` - `REMOTE_PYTHON=/path/to/python REPEATS=3 RUNS=3 BACKEND_MODE=fresh bash scripts/repeat_graph_rag_code_corpus_builds_aws.sh /path/to/repo 65320` This benchmark uses the actual `cogniformerus` source tree (`40` files, `840` rows after chunk + summary expansion) and the real CrossFile prompts from `butler_code_test.cr`. The current real-corpus conclusion is **not** a single universal winner. The frontier splits by embedding mode: | Mode | Best case | Local repeated-build p50 | AWS repeated-build p50 | Keyword coverage | Full hits | Notes | |------|-----------|:------------------------:|:----------------------:|:----------------:|:---------:|-------| | generic | `prompt_summary_snippet_py` | `0.613 ms` | `0.955 ms` | `100.0%` | `100.0%` | symbol-aware variant is strictly slower with no quality gain | | code-aware | `prompt_symbol_summary_snippet_py` | `0.963 ms` | `1.541 ms` | `100.0%` | `100.0%` | exact prompt-symbol rescue is required in summary seeding | The most important diagnostic result was the old code-aware miss: - `prompt_summary_snippet_py` on `facts_sh` - local repeated-build: `97.6%` keyword coverage, `83.3%` full hits - AWS repeated-build: `97.6%`, `83.3%` - `prompt_symbol_summary_snippet_py` on `facts_sh` - local repeated-build: `100.0%`, `100.0%` - AWS repeated-build: `100.0%`, `100.0%` So the code-aware quality win is now both repeated-build stable and cross-environment stable. The change in winner is not a local-only artifact. ### Larger in-repo `cogniformerus` transfer gate The smaller `40`-file code-corpus slice above is a useful stable benchmark, but it is not the only in-repo transfer check anymore. The same repeated-build protocol was rerun on the full `cogniformerus` repository (`183` Crystal files), still using the real CrossFile prompts from `butler_code_test.cr`. Control point at the old tiny-budget contract (`top_k=4`, `1` fresh build): | Mode | Best case | Local p50 | Keyword coverage | Full hits | Avg returned rows | Notes | |------|-----------|:---------:|:----------------:|:---------:|:-----------------:|-------| | generic | `prompt_summary_snippet_py` | `0.770 ms` | `87.1%` | `66.7%` | `3.67` | larger corpus exposes a result-budget cliff | | code-aware | `prompt_symbol_summary_snippet_py` | `1.824 ms` | `87.6%` | `66.7%` | `4.00` | same cliff under code-aware embeddings | Bounded recovery point (`top_k=8`, `3` fresh builds): | Mode | Best case | Local repeated-build p50 | Keyword coverage | Full hits | Avg returned rows | Notes | |------|-----------|:------------------------:|:----------------:|:---------:|:-----------------:|-------| | generic | `prompt_summary_snippet_py` | `0.819 ms` | `100.0%` | `100.0%` | `6.33` | larger in-repo Crystal transfer now verified | | code-aware | `prompt_symbol_summary_snippet_py` | `1.814 ms` | `100.0%` | `100.0%` | `7.50` | same winner, but needs the larger final budget | Interpretation: - the current real code-corpus contracts do transfer beyond the tiny `40`-file slice - the dominant larger-corpus issue on the in-repo Crystal side is result budget, not a new retrieval failure - this larger-corpus Crystal gate is now covered and serves as a transfer check for the benchmark-side code-retrieval logic, not as the stable release contract for GraphRAG ### Mixed-language external code-corpus GraphRAG reference benchmark (`pycdc`) The code-corpus harness now also supports: - JSON question fixtures - configurable source extensions - quoted local `#include "..."` dependency edges for C/C++ corpora The first mixed-language adversary corpus was `pycdc`, using a repo-owned fixture in `scripts/fixtures/graph_rag_pycdc_questions.json`. This run used the real `pycdc` source tree (`138` files, `1281` rows after chunk + summary expansion, `72` local dependency edges) and `3` fresh builds at `top_k=8`. | Mode | Best case | Local repeated-build p50 | Keyword coverage | Full hits | Avg returned rows | Notes | |------|-----------|:------------------------:|:----------------:|:---------:|:-----------------:|-------| | generic | `prompt_symbol_summary_snippet_py` | `0.850 ms` | `90.0%` | `60.0%` | `6.40` | fastest mixed-language point, but it does not close the corpus | | code-aware | `prompt_compactseed_require_summary_snippet_fn` | `8.006 ms` | `100.0%` | `100.0%` | `5.80` | helper-backed compact lexical seed + one-hop include rescue closes the corpus | Interpretation: - the mixed-language gate is now covered for a real `~/Projects/C` corpus - the result split is sharper than on the Crystal corpora: - the fast generic path plateaus below perfect quality - the code-aware include-rescue path closes the corpus, but at a much higher latency - so `~/Projects/C` is now covered as part of the broader code-corpus reference matrix, even though the fastest generic point remains partial ### Archive-side code-corpus GraphRAG reference benchmark (`ninja/src`) The same widened harness was then pointed at an archive-side corpus under `~/SrcArchives`: `apple/ninja/src`. This run used a second repo-owned fixture in `scripts/fixtures/graph_rag_ninja_questions.json` and the local include graph inside `ninja/src` (`103` files, `1757` rows after chunk + summary expansion, `282` dependency edges). Initial smoke at `top_k=8`: | Mode | Best fast case | Local p50 | Keyword coverage | Full hits | Notes | |------|----------------|:---------:|:----------------:|:---------:|-------| | generic | `prompt_summary_snippet_py` | `0.898 ms` | `95.0%` | `80.0%` | already close without rescue | | code-aware | `prompt_summary_snippet_py` | `0.928 ms` | `85.0%` | `80.0%` | code-aware mode is weaker on this corpus | Bounded budget probe (`top_k=12`, `3` fresh builds): | Mode | Best case | Local repeated-build p50 | Keyword coverage | Full hits | Avg returned rows | Notes | |------|-----------|:------------------------:|:----------------:|:---------:|:-----------------:|-------| | generic | `prompt_summary_snippet_py` | `0.914 ms` | `100.0%` | `100.0%` | `7.80` | archive-side gate closes with a small result-budget bump | | code-aware | `prompt_summary_snippet_py` | `0.871 ms` | `85.0%` | `80.0%` | `7.60` | still not the winner on this corpus | Interpretation: - the `~/SrcArchives` side is now covered by a real repeated-build gate - unlike `pycdc`, this archive corpus does **not** need a dependency-rescue contract to close - the winner is the simple generic summary-snippet path at a slightly larger final budget (`top_k=12`) - the larger real-corpus verification matrix for the narrow `0.13` fact-graph release now spans: - `~/Projects/Crystal` - `~/Projects/C` - `~/SrcArchives` ### External folding stress corpus for GraphRAG reference logic (`folding/src`) The same harness was then pointed at a second real code corpus outside this repository: `folding/src` with prompts from `butler_folding_test.cr`. This is not the primary publishable frontier for the repository, but it is a strong adversary corpus because it falsifies overfit retrieval contracts quickly. Current repeated-build result: | Mode | Case | Local repeated-build p50 | AWS repeated-build p50 | Keyword coverage | Full hits | Notes | |------|------|:------------------------:|:----------------------:|:----------------:|:---------:|-------| | generic | `prompt_summary_snippet_py` | `1.048 ms` | `1.540 ms` | `90.5%` | `83.3%` | fast baseline drifts below perfect quality on this corpus | | generic | `prompt_compactseed_require_summary_snippet_fn` | `5.940 ms` | `8.839 ms` | `100.0%` | `100.0%` | compact lexical seed table + helper-backed one-hop `REQUIRES_FILE` rescue | | generic | `prompt_lexseed_require_summary_snippet_fn` | `28.266 ms` | `41.960 ms` | `100.0%` | `100.0%` | historical full-summary lexical rescue, now dominated | | code-aware | `prompt_summary_snippet_py` | `1.080 ms` | `1.775 ms` | `79.8%` | `66.7%` | worse baseline than the primary `cogniformerus` corpus | | code-aware | `prompt_compactseed_require_summary_snippet_fn` | `5.804 ms` | `8.392 ms` | `100.0%` | `100.0%` | compact lexical seed table + helper-backed one-hop `REQUIRES_FILE` rescue | | code-aware | `prompt_lexseed_require_summary_snippet_fn` | `36.676 ms` | `60.457 ms` | `100.0%` | `100.0%` | historical full-summary lexical rescue, now dominated | Interpretation: - the external folding miss was a real seed-selection problem, not a snippet extraction bug - the rescue is now verified on both local Apple Silicon and AWS ARM64 - the current documented external rescue is no longer the old full-summary lexical path; it is the compact lexical-seed table variant - compact lexical seeding keeps `100.0% / 100.0%` while cutting the old rescue by about `4.8x` locally and `4.7-7.2x` on AWS, depending on mode - an isolated local timing split shows the helper-backed rescue is still dominated by lexical-seed + `REQUIRES_FILE` fetch work (`~10.7-11.0 ms/query`) with snippet postprocess as a secondary cold-start cost (`~7.7-8.0 ms/query`) - the old full-summary lexical rescue remains useful as a diagnostic, but it is no longer the external default frontier - even the compact rescue is still slower than the primary in-repo winners, so it does not replace them as the default GraphRAG contract ### Legacy/manual IVF-PQ benchmark The sections below are still useful for the explicit IVF-PQ API (`svec_ann_scan`), but they are no longer the default ANN baseline for the repository. Those measurements target the legacy/manual vector path, not the planner-integrated `sorted_hnsw` Index AM. All IVF-PQ benchmarks below use `svec_ann_scan` (C-level) with residual PQ. 1 Gi k8s pod, PostgreSQL 18. ### 103K vectors, 2880-dim (Gutenberg corpus) Residual PQ (M=720, dsub=4), 256 IVF partitions. 100 cross-queries (self-match excluded): | Config | R@1 | Recall@10 | Avg latency | |---|---|---|---| | nprobe=1, PQ-only | 54% | 48% | 5.5 ms | | nprobe=3, PQ-only | 79% | 71% | 8 ms | | nprobe=3, rerank=96 | 82% | 74% | 10 ms | | nprobe=5, rerank=96 | 89% | 86% | 12 ms | | nprobe=10, rerank=200 | 97% | 94% | 22 ms | Self-query (vector in dataset): R@1 = 100% at nprobe=3 / 8 ms. ### 10K vectors, 2880-dim (float32 precision test) Same corpus, pure svec (float32), nlist=64, M=720 residual PQ. 100 cross-queries: | Config | R@1 | Recall@10 | |---|---|---| | nprobe=1, PQ-only | 56% | 56% | | nprobe=3, PQ-only | 72% | 82% | | nprobe=5, rerank=96 | 93% | 93% | | nprobe=10, rerank=200 | 99% | 99% | ### float32 vs float16 precision impact Tested the same 10K Gutenberg vectors in two configurations: - **float32 (svec):** native 32-bit storage, independently trained codebooks - **float16-degraded (hsvec):** svec → hsvec → svec roundtrip, independently trained Result: **no measurable recall difference**. Float16 precision loss (~1e-7) is 1000× smaller than typical distance gaps between consecutive neighbors (~1e-4). The recall bottleneck is PQ quantization and IVF routing, not input precision. This confirms hsvec is a safe storage choice for ANN workloads. ### CRUD performance (500K rows, svec(128), prepared mode) | Operation | eager / heap | lazy / heap | Notes | |-----------|:-----------:|:-----------:|-------| | SELECT PK | 85% | 85% | Index Scan via btree | | SELECT range 1K | 97% | -- | Custom Scan pruning (eager only) | | Bulk INSERT | 100% | 100% | Always eager | | DELETE + INSERT | 63% | 63% | INSERT always eager | | UPDATE non-vec | 46% | **100%** | Lazy skips zone map flush | | UPDATE vec col | 102% | **100%** | Parity both modes | | Mixed OLTP | 83% | **97%** | Near-parity with lazy | Eager mode (default) maintains zone maps on every UPDATE for scan pruning. Lazy mode (`sorted_heap.lazy_update = on`) trades scan pruning for UPDATE parity with heap. Compact/merge restores pruning. ### Self-query vs cross-query **Self-query**: query vector is in the dataset (typical RAG case — you embedded documents, now you search them). The vector is always found as its own closest neighbor, so R@1 = 100%. **Cross-query**: query vector is NOT in the dataset (e.g., user question embedded at search time). R@1 depends on nprobe and PQ fidelity. When comparing benchmarks, verify whether self-match is included or excluded. The tables above use cross-query (self-match excluded) for honest comparison. --- ## Methodology notes - **EXPLAIN ANALYZE:** warm cache (pg_prewarm), average of 5 runs, actual execution time + buffer reads reported - **pgbench:** 10 s runtime, 1 client, includes pgbench overhead (connection management, query dispatch); useful for relative throughput comparison - **INSERT:** COPY path via `INSERT ... SELECT generate_series()` - **Compact time:** wall-clock time for `sorted_heap_compact()` on warm data - **Vector search:** 100 random queries from the dataset, self-match excluded by requesting `lim := 11` and taking positions 2–11. Ground truth via exact brute-force cosine (`<=>` operator). Latency measured via `clock_timestamp()` per-query in PL/pgSQL loop (20 queries, warm cache) - **Current local `sorted_hnsw` comparison:** deterministic synthetic 10K x 384D corpus via `scripts/bench_sorted_hnsw_vs_pgvector.sh`, 3 fresh builds for PostgreSQL methods, median p50 / median recall reported; Qdrant via `scripts/bench_qdrant_synthetic.py`, 3 warm measurement passes on one local Docker collection; zvec via `scripts/bench_zvec_synthetic.py`, 3 warm measurement passes on one local in-process collection - **Current local real-dataset sample:** `scripts/bench_ann_real_dataset.py` on ANN-Benchmarks `nytimes-256-angular`, sampled to 10K base vectors and 20 queries. Ground truth is exact PostgreSQL `svec` heap search on the sampled corpus. Numbers above are medians across 3 full harness runs.