# zvec empty-id retrieval bug This note is an upstream-ready issue draft for a `zvec` retrieval defect that showed up during GraphRAG parity work in `pg_sorted_heap`. The short version: - ANN scores still come back - returned `doc.id` values become empty strings - the failure reproduces on both: - a real-text Gutenberg GraphRAG corpus - a plain synthetic FP32 corpus So this does **not** look like a PostgreSQL expansion/rerank bug. Environment used for the verified local repro: - `zvec 0.2.0` - Python package at: - `/opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/zvec` - native extension: - `/opt/homebrew/Caskroom/miniconda/base/lib/python3.12/site-packages/_zvec.cpython-312-darwin.so` ## Minimal synthetic reproducer Repo-owned script: - [`scripts/repro_zvec_synthetic_threshold.py`](../scripts/repro_zvec_synthetic_threshold.py) Command: ```bash python3 scripts/repro_zvec_synthetic_threshold.py --rows 4900,4950,5000 --query-count 5 ``` Current verified output shape: ```text SYNTH_THRESH|rows=4900|status=ok|first_bad_query=None|sample=None SYNTH_THRESH|rows=4950|status=bad|first_bad_query=1|sample=['', '', '', '', '', '', ''] SYNTH_THRESH|rows=5000|status=bad|first_bad_query=1|sample=['', '', '', '', '', '', ''] ``` stderr also reports: ```text Failed to find target chunk for index 4945 ``` Current minimal signature: - `dim=32` - `ef_search=64` - `topk=7` - `rows=4950` Neighbor controls: - `rows=4900`, same params: `ok` - `rows=4950`, `topk<=6`: `ok` - `rows=4950`, `topk>=7`: `bad` The compact repro also survives simple runtime knob changes: - `memory_limit_mb=8192`: `bad` - `memory_limit_mb=1024`: `bad` - `memory_limit_mb=256`: `bad` - `query_threads=1`, `optimize_threads=1`: `bad` - `query_threads=2`, `optimize_threads=2`: `bad` - `query_threads=4`, `optimize_threads=4`: `bad` So the current minimal case does not look like a trivial thread-count or memory-budget artifact. It also survives broad HNSW parameter changes on the same compact case: - `ef_search=16`, `ef_construction=16`, `m=8`: `bad` - `ef_search=16`, `ef_construction=64`, `m=16`: `bad` - `ef_search=32`, `ef_construction=64`, `m=16`: `bad` - `ef_search=64`, `ef_construction=64`, `m=16`: `bad` - `ef_search=128`, `ef_construction=64`, `m=16`: `bad` - `ef_search=64`, `ef_construction=128`, `m=16`: `bad` - `ef_search=64`, `ef_construction=64`, `m=8`: `bad` - `ef_search=64`, `ef_construction=64`, `m=32`: `bad` So the compact failing case does not look like a fragile HNSW tuning artifact either. ## Stronger diagnostics On the compact synthetic case: - `rows=4950`, `topk=6` - valid ids come back - `rows=4950`, `topk=7` - scores still come back - every `doc.id` becomes `''` That means the failure is not "query returns nothing". Ranking still appears to produce plausible scores, but document metadata resolution fails. Representative observation: ```text CASE 4950 6 [('1600', 0.12408530712127686), ..., ('2946', 0.14136314392089844)] CASE 4950 7 [('', 0.12408530712127686), ..., ('', 0.14136314392089844)] ``` The bug is also non-monotonic by row count. Verified examples: - bad: `4950`, `5000`, `7500`, `7900`, `16000`, `28000`, `30000`, `45000`, `60000` - ok: `4900`, `7000`, `7800`, `24000`, `75000` So this is not a simple "after N rows everything breaks" threshold. ## Real-text corroboration Repo-owned script: - [`scripts/repro_zvec_gutenberg_threshold.py`](../scripts/repro_zvec_gutenberg_threshold.py) Current verified Gutenberg signature: - `dim=32` - `topk=16` - `ef_search=64` - `64x256`, `80x256`, `96x256`, `112x256` slices: `ok` - `128x256` (`58,954` rows): `bad` Observed failure: ```text Failed to find target chunk for index 58379 ``` Returned ids are empty strings / unmapped ids for the first bad probe. ## Additional context One larger synthetic case gives another useful hint: - `rows=16000` - exact cosine inspection shows the best-score bucket spans `1000, 2000, ..., 16000` - `zvec` already returns empty ids at `topk=5` This does not prove the internal root cause, but it suggests the failure may depend on candidate-materialization / metadata-fetch paths rather than on the ANN score computation itself. The shipped native binary also contains the exact failing message plus nearby storage component paths: ```text /Users/cuiys/workspace/zvec/src/db/index/storage/mmap_forward_store.cc /Users/cuiys/workspace/zvec/src/db/index/storage/bufferpool_forward_store.cc Failed to find target chunk for index %d Encountered empty chunk at index %d ``` That does not prove which codepath is at fault, but it makes the working component hypothesis narrower: the failure is plausibly in forward-store chunk lookup or chunk materialization rather than in HNSW distance ranking itself. Debug file logging on the compact synthetic case adds one more strong clue. With `log_level=DEBUG`, `query_threads=1`, and the same `4950 / topk=7` repro, `zvec` reports: ```text Opened IPC with 4950 rows, 2 cols, 2 chunks, is_fixed_batch_size[1] fixed_batch_size[2432] ... Failed to find target chunk for index 4945 ... Record batch: _zvec_uid_: [ null, null, null, null, null, null, null ] ``` So the strongest current reading is: - the forward store opens successfully - the query reaches the recall/output stage - metadata resolution for `_zvec_uid_` fails and the returned batch carries null ids even though scores are still present ## Why this matters For the `pg_sorted_heap` GraphRAG benchmark harness, this bug currently blocks a clean large-slice `zvec` parity row. PostgreSQL-side expansion+rereank remains stable; the unstable stage is the `zvec` ANN seed retrieval itself. ## Status Verified locally with repo-owned reproducers. No claim yet about the exact internal root cause inside `zvec`.