---
layout: default
title: Vector Search
nav_order: 3
---

# Vector Search

pg_sorted_heap includes two built-in vector types, a planner-integrated
`sorted_hnsw` Index AM, and legacy/manual ANN paths (`svec_ann_scan`,
`svec_hnsw_scan`). The default vector story is now `CREATE INDEX ... USING
sorted_hnsw`; the older IVF-PQ and sidecar HNSW APIs remain available when you
want explicit control over storage or rerank behavior.

Release guidance:

- **Stable:** `sorted_hnsw` on `svec` and `hsvec`
- **Legacy/manual:** `svec_ann_scan`, `svec_ann_search`, `svec_hnsw_scan`

---

## Vector types

| Type | Precision | Bytes/dim | Max dimensions | Use case |
|------|-----------|-----------|----------------|----------|
| `svec` | float32 | 4 | 16,000 | Full precision, training codebooks |
| `hsvec` | float16 | 2 | 32,000 | Storage-optimized, large embeddings |

Both types support the `<=>` cosine distance operator (returns 1 − cosine
similarity, range [0, 2]). Distance is accumulated in `float8` (64-bit) for
precision.

`hsvec` casts to `svec` implicitly — all PQ/IVF functions accept both types
without code changes. `svec` casts to `hsvec` via explicit or assignment cast
(lossy, float32 → float16).

**Dimension envelope:** pgvector's dense `vector` type is limited to 2,000
dimensions, while `halfvec` extends that to 4,000. `svec` supports up to 16K
dims and `hsvec` up to 32K dims, so pg_sorted_heap still has a larger native
ANN/storage envelope for very high-dimensional embeddings.

```sql
-- svec: float32, up to 16,000 dimensions
CREATE TABLE items (
    id        text PRIMARY KEY,
    embedding pg_sorted_heap.svec(768)
);

-- hsvec: float16, up to 32,000 dimensions
CREATE TABLE items_compact (
    id        text PRIMARY KEY,
    embedding pg_sorted_heap.hsvec(768)
);

-- Insert with bracket notation (same for both types)
INSERT INTO items VALUES ('doc1', '[0.1, 0.2, 0.3, ...]');

-- Cosine distance operator works on both types
SELECT a.id, a.embedding <=> b.embedding AS distance
FROM items a, items b
WHERE a.id = 'doc1' AND b.id = 'doc2';

-- hsvec casts to svec implicitly for PQ/IVF functions
SELECT svec_cosine_distance('[1,0,0]'::hsvec, '[0,1,0]'::hsvec);
```

---

## Current default: `sorted_hnsw`

For new deployments, prefer the Index AM. It supports both `svec` and `hsvec`
source columns; use `hsvec` when you want the heap/TOAST footprint to stay
close to pgvector `halfvec`, with float32 used only as internal scratch during
build/search/rerank.

```sql
CREATE TABLE items (
    id        bigserial PRIMARY KEY,
    embedding pg_sorted_heap.svec(384),
    body      text
);

CREATE INDEX items_embedding_idx
ON items USING sorted_hnsw (embedding)
WITH (m = 16, ef_construction = 200);

SET sorted_hnsw.shared_cache = on;
SET sorted_hnsw.ef_search = 96;

SELECT id, body
FROM items
ORDER BY embedding <=> '[0.1,0.2,0.3,...]'::pg_sorted_heap.svec
LIMIT 10;
```

On constrained builders, the current low-memory build knob is:

```sql
SET sorted_hnsw.build_sq8 = on;
```

That makes `CREATE INDEX ... USING sorted_hnsw` build the graph from
SQ8-compressed build vectors instead of a full float32 build slab. The tradeoff
is one extra heap scan during build and a possible graph-quality loss on some
corpora, but on the current local `1M x 64D` multidepth GraphRAG point it
preserved quality and slightly improved build time.

Compact-storage variant:

```sql
CREATE TABLE items_compact (
    id        bigserial PRIMARY KEY,
    embedding pg_sorted_heap.hsvec(384),
    body      text
);

CREATE INDEX items_compact_embedding_idx
ON items_compact USING sorted_hnsw (embedding hsvec_cosine_ops)
WITH (m = 16, ef_construction = 200);
```

This path is planner-integrated and exact-reranks internally. There is no
sidecar prefix argument and no manual `rerank_topk` in the index-scan path.

Current ordered-scan contract:

- automatic `sorted_hnsw` planning is for base-relation
  `ORDER BY embedding <=> query LIMIT k`
- the planner does not use the current Phase 1 path when there is no `LIMIT`,
  when `LIMIT > sorted_hnsw.ef_search`, or when extra base-table quals would
  make the index under-return candidates
- `sorted_hnsw.shared_cache` is most useful when
  `shared_preload_libraries = 'pg_sorted_heap'`; otherwise scans fall back to
  backend-local cache builds

For filtered retrieval or expansion workflows, materialize/filter first or use
the GraphRAG helper API instead of expecting the ordered index scan to serve as
a general filtered ANN primitive. The remaining filtered-ANN contracts are
tracked in [Filtered ANN Contracts](spec-filtered-ann).

For declarative partitioned tables, `sorted_hnsw_partition_search(...)` provides
an explicit route-first helper. Selected leaves run local `sorted_hnsw` scans,
their candidate pools are unioned, and the final result is globally reranked by
exact distance. Use this when tenant/time/segment routing maps to whole
partitions instead of executor filters inside one leaf.

Benchmark the route-first contract with:

```bash
make bench-partitioned-sorted-hnsw
```

On the local PostgreSQL 18 default run (`8 x 50K` rows, self-query top-10),
selected-leaf routing measured `5.359 ms` average at `100.0%` recall@10 versus
`8.849 ms` for the parent filtered exact query. The same run showed all-leaf
fanout around `23-25 ms`. The script now also reports `direct_leaf_index`:
`2.942 ms` on the same run, which quantifies the current PL/pgSQL wrapper
overhead and is the main signal for a future C fanout helper. The promotion
criteria for that work are tracked in
[Partitioned HNSW C Helper Gate](spec-partitioned-hnsw-c-helper).

---

## Legacy/manual IVF-PQ quick start

### 1. Create table

The table uses `partition_id` (IVF cluster assignment) as the leading PK column.
This makes sorted_heap cluster rows by IVF partition physically — the zone map
then skips irrelevant partitions at the I/O level.

```sql
CREATE TABLE vectors (
    id           text,
    partition_id int2 GENERATED ALWAYS AS (
                     pg_sorted_heap.svec_ivf_assign(embedding, 1)) STORED,
    embedding    pg_sorted_heap.svec(768),
    pq_code      bytea GENERATED ALWAYS AS (
                     pg_sorted_heap.svec_pq_encode_residual(
                         embedding,
                         pg_sorted_heap.svec_ivf_assign(embedding, 1),
                         2, 1)) STORED,
    PRIMARY KEY (partition_id, id)
) USING sorted_heap;
```

The `pq_code` column stores M-byte Product Quantization codes. Both columns are
generated automatically — you only INSERT `id` and `embedding`.

### 2. Train codebooks

> **Permissions:** Training creates internal metadata tables in the extension
> schema. The calling role needs `CREATE` privilege on that schema (or be the
> extension owner / superuser). For non-superuser roles:
> `GRANT CREATE ON SCHEMA <ext_schema> TO <role>;`

```sql
-- Train IVF centroids (nlist partitions) + PQ codebook (M subvectors)
SELECT * FROM pg_sorted_heap.svec_ann_train(
    'SELECT embedding FROM vectors',
    nlist := 64,     -- number of IVF partitions
    m     := 192     -- PQ subvectors (768/192 = 4-dim each)
);
-- Returns: ivf_cb_id=1, pq_cb_id=1

-- For higher recall, train residual PQ (trains on vec - centroid residuals)
SELECT pg_sorted_heap.svec_pq_train_residual(
    'SELECT embedding FROM vectors',
    m := 192, ivf_cb_id := 1);
-- Returns: pq_cb_id=2
```

After training, compact the table so rows re-cluster by their new `partition_id`:

```sql
SELECT pg_sorted_heap.sorted_heap_compact('vectors');
```

### 3. Search

```sql
-- PQ-only (fastest): ~8 ms, R@1 79% cross-query / 100% self-query
SELECT * FROM pg_sorted_heap.svec_ann_scan(
    'vectors', query_vec,
    nprobe := 3, lim := 10,
    cb_id := 2, ivf_cb_id := 1);

-- With reranking (higher recall): ~22 ms, R@1 97%
SELECT * FROM pg_sorted_heap.svec_ann_scan(
    'vectors', query_vec,
    nprobe := 10, lim := 10, rerank_topk := 200,
    cb_id := 2, ivf_cb_id := 1);
```

---

## How IVF-PQ works

```
query vector
    │
    ├─ IVF probe: find nearest nprobe centroids
    │     → partition_id IN (3, 17, 42, ...)
    │
    ├─ PQ ADC: for each candidate row, sum M precomputed distances
    │     → O(M) per row using M-byte PQ code (no TOAST decompression)
    │
    ├─ Top-K: max-heap selects best candidates
    │
    └─ Optional rerank: exact cosine on top candidates → return top-K
```

Physical clustering by `(partition_id, id)` means the IVF probe translates
directly to a small set of physical block ranges — sorted_heap's zone map
skips all other partitions at the I/O level.

### Residual PQ

Standard PQ encodes vectors directly. **Residual PQ** encodes the residual
`(vector − IVF centroid)` instead. This removes inter-centroid variance so PQ
focuses on fine intra-cluster distinctions, improving recall at no storage cost.

The trade-off: residual PQ requires computing a separate distance table per
probed centroid (vs one global table for standard PQ), roughly doubling the
PQ-only latency. With reranking, the difference is negligible.

Use residual PQ by passing `ivf_cb_id` to `svec_ann_scan`:

```sql
-- cb_id=2 is the residual PQ codebook, ivf_cb_id=1 is the IVF codebook
SELECT * FROM pg_sorted_heap.svec_ann_scan(
    'vectors', query_vec,
    nprobe := 3, lim := 10,
    cb_id := 2, ivf_cb_id := 1);
```

---

## Tuning guide

### nprobe and rerank_topk

| Use case | nprobe | rerank | Latency | R@1 | Recall@10 |
|---|---|---|---|---|---|
| Lowest latency | 1 | 0 | 5.5 ms | 54% | 48% |
| Self-query RAG | 3 | 0 | 8 ms | 100%* | 71% |
| Balanced | 5 | 96 | 12 ms | 89–93% | 86–93% |
| High quality | 10 | 200 | 22 ms | 97–99% | 94–99% |

\* Self-query R@1 = 100% because the query vector is in the dataset.
Cross-query R@1 at nprobe=3 is 79%. The recall ranges reflect different
dataset sizes (10K–103K vectors).

**Guidelines:**
- Start with **nprobe=3, no rerank** for RAG workloads (searching your own corpus)
- Add **rerank=96** if you need cross-query accuracy (query not in corpus)
- Increase **nprobe** for higher recall at the cost of latency
- **nprobe × (rows/nlist)** = approximate number of PQ codes scanned per query

### nlist and M

| Parameter | Effect |
|---|---|
| nlist (IVF partitions) | More partitions = more precise routing but smaller clusters. 64–256 typical. |
| M (PQ subvectors) | More subvectors = higher PQ fidelity. dim/M = subvector dimension (4–16 typical). |

Rule of thumb: `nlist ≈ sqrt(N)` where N is dataset size. `M = dim/4` gives
4-dimensional subvectors — a good balance of fidelity and code size.

---

## Benchmarks

The tables below mix the current stable Index AM path with the older
legacy/manual ANN paths. Use the `sorted_hnsw` rows as the release default;
use the IVF-PQ and sidecar rows only when you explicitly want those manual
trade-offs.

For future very-large tables, IVF-PQ / SQ-family revival is tracked as a
benchmark-gated scale feature, not as a default replacement. See
[Large-Vector Sublinear Search](spec-large-vector-sublinear).

SIMD ADC optimization and pgvectorscale StreamingDiskANN comparison are tracked
separately in [SIMD ADC And DiskANN Comparison](spec-simd-adc-diskann).

### 103K vectors, 2880-dim (Gutenberg corpus)

Residual PQ (M=720, dsub=4), 256 IVF partitions.
1 Gi k8s pod, PostgreSQL 18. 100 cross-queries (self-match excluded):

| Config | R@1 | Recall@10 | Avg latency |
|---|---|---|---|
| nprobe=1, PQ-only | 54% | 48% | 5.5 ms |
| nprobe=3, PQ-only | 79% | 71% | 8 ms |
| nprobe=3, rerank=96 | 82% | 74% | 10 ms |
| nprobe=5, rerank=96 | 89% | 86% | 12 ms |
| nprobe=10, rerank=200 | 97% | 94% | 22 ms |

### 10K vectors, 2880-dim (float32 precision test)

Same corpus, pure svec (float32), nlist=64, M=720 residual PQ:

| Config | R@1 | Recall@10 |
|---|---|---|
| nprobe=1, PQ-only | 56% | 56% |
| nprobe=3, PQ-only | 72% | 82% |
| nprobe=5, rerank=96 | 93% | 93% |
| nprobe=10, rerank=200 | **99%** | **99%** |

### float32 vs float16 precision

Tested the same 10K Gutenberg vectors stored as float32 (svec) vs
float16-degraded (svec → hsvec → svec roundtrip). Both trained independently
with identical parameters. **No measurable recall difference** — float16
precision loss (~1e-7) is 1000× smaller than typical distance gaps between
neighbors (~1e-4). Precision is not the recall bottleneck; PQ quantization
and IVF routing are. This confirms hsvec is a safe storage choice for ANN.

### Comparison with other vector search engines

Current repo-owned harnesses:

- `python3 scripts/bench_gutenberg_local_dump.py --dump /tmp/cogniformerus_backup/cogniformerus_backup.dump --port 65473`
- `REMOTE_PYTHON=/path/to/python SH_EF=32 EXTRA_ARGS='--sh-ef-construction 200' ./scripts/bench_gutenberg_aws.sh <aws-host> /path/to/repo /path/to/dump 65485`
- `scripts/bench_sorted_hnsw_vs_pgvector.sh /tmp 65485 10000 20 384 10 vector 64 96`
- `make bench-partitioned-sorted-hnsw`
- `python3 scripts/bench_ann_real_dataset.py --dataset nytimes-256 --sample-size 10000 --queries 20 --k 10 --pgv-ef 64 --sh-ef 96 --zvec-ef 64 --qdrant-ef 64`
- `python3 scripts/bench_qdrant_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64`
- `python3 scripts/bench_zvec_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64`

AWS restored Gutenberg dump (`~104K x 2880D`, top-10, exact heap GT on the
restored `svec` table). Host: AWS ARM64, 4 vCPU,
8 GiB RAM. In the current rerun the stored `bench_hnsw_gt` table matched the
recomputed exact GT on 100% of the 50 benchmark queries after restore, so the
fresh exact heap GT and the historical GT table agree. This rerun uses
`sorted_hnsw` `ef_construction=200` and `ef_search=32`, and the benchmark
harness reconnects after build before timing ordered scans.

| Method | p50 latency | Recall@10 | Notes |
|--------|:-----------:|:---------:|-------|
| Exact heap (`svec`) | 458.762 ms | 100.0% | brute-force GT on restored corpus |
| **`sorted_hnsw` (`svec`)** | **1.287 ms** | **100.0%** | `ef_construction=200`, `ef_search=32`, index 404 MB, total 1902 MB |
| `sorted_hnsw` (`hsvec`) | 1.404 ms | 100.0% | `ef_construction=200`, `ef_search=32`, index 404 MB, total 1032 MB |
| pgvector HNSW (`halfvec`) | 2.031 ms | 99.8% | `ef_search=64`, index 804 MB, total 1615 MB |
| zvec HNSW | 50.499 ms | 100.0% | in-process collection, `ef=64`, ~1.12 GiB on disk |
| Qdrant HNSW | 6.028 ms | 99.2% | local Docker on same AWS host, `hnsw_ef=64`, 103,260 points |

The precision-matched PostgreSQL row on Gutenberg is `sorted_hnsw (hsvec)`
vs pgvector `halfvec`: `1.404 ms @ 100.0%` versus `2.031 ms @ 99.8%`, with
total footprint `1032 MB` versus `1615 MB`. The raw fastest PostgreSQL row on
this corpus is still `sorted_hnsw (svec)` at `1.287 ms`, but that uses
float32 source storage. The `sorted_hnsw` index stays 404 MB in both cases
because it stores SQ8 graph state; the size win from `hsvec` appears in the
source table and TOAST footprint (`1902 MB -> 1032 MB`), not in the index.

Synthetic 10K x 384D cosine corpus, top-10, warm query loop. PostgreSQL
methods were rerun across 3 fresh builds and the table below reports median
`p50` / median recall. Qdrant uses 3 warm measurement passes on one local
Docker collection.

| Method | p50 latency | Recall@10 | Notes |
|--------|:-----------:|:---------:|-------|
| Exact heap (`svec`) | 2.03 ms | 100% | Brute-force ground truth |
| **sorted_hnsw** | **0.158 ms** | **100%** | `shared_cache=on`, `ef_search=96`, index ~5.4 MB |
| pgvector HNSW (`vector`) | 0.446 ms | 90% median (90-95 range) | `ef_search=64`, same `M=16`, `ef_construction=64`, index ~2.0 MB |
| zvec HNSW | 0.611 ms | 100% | local in-process collection, `ef=64` |
| Qdrant HNSW | 1.94 ms | 100% | local Docker, `hnsw_ef=64` |

Real-dataset sample (`nytimes-256-angular`, sampled 10K x 256D, top-10). The
table below reports medians across 3 full harness runs. Ground truth is exact
heap search inside PostgreSQL on the sampled corpus.

| Method | p50 latency | Recall@10 | Notes |
|--------|:-----------:|:---------:|-------|
| Exact heap (`svec`) | 1.557 ms | 100% | ground truth |
| **sorted_hnsw** | **0.327 ms** | **85.0% median** (83.5-85.5 range) | `shared_cache=on`, `ef_search=96`, index ~4.1 MB |
| pgvector HNSW (`vector`) | 0.751 ms | 79.0% median (78.5-79.0 range) | `ef_search=64`, same `M=16`, `ef_construction=64`, index ~13 MB |
| zvec HNSW | 0.403 ms | 99.5% | local in-process collection, `ef=64`, ~14.1 MB on disk |
| Qdrant HNSW | 1.704 ms | 99.5% | local Docker, `hnsw_ef=64` |

This dataset is far harsher than the deterministic synthetic corpus. Use the
synthetic table for controlled regression tracking and the `nytimes-256`
sample when you want a better read on fixed-parameter recall.

See [Sidecar HNSW search](#sidecar-hnsw-search-legacy-svec_hnsw_scan) below
for the legacy/manual `svec_hnsw_scan` path and its cache-mode tradeoffs.

### Self-query vs cross-query

**Self-query**: the query vector exists in the dataset. This is the common RAG
case — you embedded documents, now you search them. R@1 is 100% because the
query is trivially its own closest neighbor.

**Cross-query**: the query vector is NOT in the dataset (e.g., a user's question
embedded at search time). R@1 depends on nprobe and PQ fidelity.

### Reproducible benchmark

```sql
-- Pick 100 random queries, compute ground truth and ANN recall
WITH queries AS (
    SELECT id AS qid, embedding AS qvec
    FROM your_table ORDER BY random() LIMIT 100
),
ground_truth AS (
    SELECT q.qid,
           array_agg(t.id ORDER BY t.embedding <=> q.qvec) AS gt
    FROM queries q
    CROSS JOIN LATERAL (
        SELECT id, embedding FROM your_table
        WHERE id != q.qid
        ORDER BY embedding <=> q.qvec LIMIT 10
    ) t GROUP BY q.qid
),
ann_results AS (
    SELECT q.qid,
           (array_agg(a.id ORDER BY a.distance))[2:11] AS ann
    FROM queries q
    CROSS JOIN LATERAL pg_sorted_heap.svec_ann_scan(
        'your_table', q.qvec, nprobe := 3, lim := 11,
        cb_id := 2, ivf_cb_id := 1) a
    GROUP BY q.qid
)
SELECT
    round((avg(CASE WHEN gt.gt[1] = ar.ann[1]
               THEN 1.0 ELSE 0.0 END) * 100)::numeric, 1) AS "R@1",
    round((avg((SELECT count(*)::numeric
               FROM unnest(ar.ann) x
               WHERE x = ANY(gt.gt)) / 10.0) * 100)::numeric, 1) AS "Recall@10"
FROM ground_truth gt
JOIN ann_results ar ON gt.qid = ar.qid;
```

Note: `lim := 11` and `[2:11]` skip the self-match (position 1) for cross-query
evaluation. For self-query benchmarks, use `lim := 10` without slicing.

For the local synthetic `bench_nomic` setup used during graph/IVF tuning, use
`scripts/bench_nomic_local_ann.py` to reproduce exact ground truth,
`svec_graph_scan`, and `svec_ann_scan` latency/recall curves from one
command. The reproducible Make targets are:

```bash
make build-graph-bench-nomic \
  VECTOR_BENCH_DSN='host=/tmp port=65432 dbname=bench_nomic'

make bench-nomic-ann \
  VECTOR_BENCH_DSN='host=/tmp port=65432 dbname=bench_nomic'
```

Local graph/ANN tooling expects the Python packages listed in
`scripts/requirements-vector-tools.txt`. CI installs that file directly; for
local runs, `scripts/find_vector_python.sh` resolves a Python that can import
the same dependency set.

To rebuild the graph sidecar used by `svec_graph_scan`, use
`scripts/build_graph.py`. The committed workflow for the local `bench_nomic`
setup is:

```bash
"$(./scripts/find_vector_python.sh)" scripts/build_graph.py \
  --dsn 'host=/tmp port=65432 dbname=bench_nomic' \
  --table bench_nomic_8k \
  --graph-table bench_nomic_graph \
  --entry-table bench_nomic_graph_entries \
  --bootstrap \
  --sketch-dim 384 \
  --M 32 \
  --M-max 64 \
  --n-adjacent 4 \
  --no-prune \
  --seed 42
```

Then benchmark the rebuilt graph against exact and IVF baselines:

```bash
"$(./scripts/find_vector_python.sh)" scripts/bench_nomic_local_ann.py \
  --dsn 'host=/tmp port=65432 dbname=bench_nomic' \
  --graph-table bench_nomic_graph \
  --entry-table bench_nomic_graph_entries \
  --query-limit 20 \
  --graph-efs 128,256,512,1024 \
  --ivf-nprobes 40 \
  --warmup 1
```

Builder notes:

- `--bootstrap` reads directly from the main table and derives `src_tid` from
  `ctid`.
- rebuild mode rejoins on `src_tid`/`ctid`, not `id`, so it stays correct for
  `(partition_id, id)` primary keys where `id` is not globally unique.
- `M`, `M-max`, and `n-adjacent` change graph topology; re-run the harness
  after each build rather than carrying over numbers from an older graph.

**pg_dump / pg_restore limitation:** HNSW sidecar tables store `src_tid`
(physical heap tuple pointer) which changes after `pg_restore` because
COPY rewrites all heap pages with new TIDs. After restore, the sidecar's
`src_tid` values point to wrong or nonexistent heap tuples, causing
recall to silently drop (observed: 88% → 99.8% after rebuild on the same
data). **Always rebuild the HNSW sidecar after `pg_dump`/`pg_restore`.**
This limitation does **not** affect `sorted_hnsw`, which uses PostgreSQL's
normal index infrastructure instead of sidecar `src_tid` joins.

---

## Sidecar HNSW search (legacy: `svec_hnsw_scan`)

`svec_hnsw_scan` performs hierarchical HNSW search using compact sidecar
tables. The latency/recall tables in this section describe that legacy/manual
path, not the current `sorted_hnsw` Index AM baseline. The L0 column type
controls the recall/memory tradeoff:

| L0 column | Cache mode | Cache size (103K) | Recall@10 (ef=96) | p50 |
|-----------|------------|:------------------:|:-----------------:|:---:|
| `hsvec(384)` | float16 sketch | ~75 MB | 97% | 0.7ms |
| `svec(D)` | SQ8 quantized (default) | ~D/4 × N | 98.4% | 1.3ms |
| `svec(D)` + `sq8=off` | float32 full | ~D×4 × N | 99.6% | 1.5ms |

The cache auto-detects the L0 column type (svec vs hsvec) at build time.
For svec columns, SQ8 scalar quantization (uint8 per dimension) is applied
automatically for 4x memory savings, controlled by the
`sorted_heap.hnsw_cache_sq8` GUC (default on).

### Table requirements

```
{prefix}_meta   — entry_nid int4, max_level int2
{prefix}_l0     — nid int4 PK, sketch hsvec(N)|svec(D), neighbors int4[],
                  src_id text, src_tid tid
{prefix}_l1..lN — nid int4 PK, sketch hsvec(N), neighbors int4[]
```

Upper levels always use hsvec sketches. Only L0 supports svec for hybrid mode.

### Building the graph

```bash
# Sketch-only L0 (fastest, smallest cache, ~97% recall)
python scripts/build_hnsw_graph.py \
  --dsn 'host=... dbname=...' \
  --source-table my_graph_table \
  --prefix my_hnsw \
  --M 16 --ef-construction 200

# Hybrid L0: full vectors in L0, sketches in upper levels (~99%+ recall)
python scripts/build_hnsw_graph.py \
  --dsn 'host=... dbname=...' \
  --source-table my_graph_table \
  --prefix my_hnsw \
  --M 16 --ef-construction 200 \
  --full-vectors --main-table my_sorted_heap_table

# Truncated L0: first 768 dims only (for MRL/Matryoshka embeddings)
python scripts/build_hnsw_graph.py \
  --dsn 'host=... dbname=...' \
  --source-table my_graph_table \
  --prefix my_hnsw \
  --M 16 --ef-construction 200 \
  --full-vectors --main-table my_sorted_heap_table --l0-dim 768
```

### Calling the function

```sql
SELECT * FROM svec_hnsw_scan(
    tbl          := 'my_table'::regclass,
    query        := '[0.1, 0.2, ...]'::svec,
    prefix       := 'my_table_hnsw',
    ef_search    := 96,    -- beam width for L0 traversal
    lim          := 10,    -- results to return
    rerank_topk  := 48,    -- candidates to exact-rerank (see below)
    rerank1_topk := 0      -- dense r1 pre-filter (0 = disabled)
);
```

Enable the session-local cache for best latency (built once per session):

```sql
SET sorted_heap.hnsw_cache_l0 = on;

-- SQ8 quantization (default on, 4x memory saving for svec L0):
SET sorted_heap.hnsw_cache_sq8 = on;   -- default
-- To disable SQ8 for maximum recall without rerank:
SET sorted_heap.hnsw_cache_sq8 = off;
```

### `rerank_topk` semantics

`rerank_topk` controls how many L0 candidates are passed to exact svec cosine
rerank. Exact rerank always runs when the L0 table has a `src_tid` column
(which `build_hnsw_graph.py` always adds).

| `rerank_topk` value | Candidates reranked | Effect |
|---|---|---|
| `0` (default) | all `ef_search` | No truncation. Highest recall, `ef_search` TOAST reads. |
| `0 < rk < ef_search` | `rk` | Truncates before rerank. Fewer TOAST reads, lower recall. |
| `rk >= ef_search` | all `ef_search` | No effect (same as 0). |

**`rerank_topk=0` does NOT skip exact rerank.** It means "rerank all
candidates". To return results by sketch distance only (skipping TOAST reads
entirely), the L0 table must omit the `src_tid` column — this is not the
default build.

### Recommended operating points (103K × 2880-dim, k8s 2 Gi pod)

**hsvec(384) sketch L0:**

| Goal | ef_search | rerank_topk | p50 latency | Recall@10 |
|---|---|---|---|---|
| Balanced | 96 | 48 | 1.02ms | 96.8% |

**svec(D) hybrid L0 (SQ8 cache, default):**

| Goal | ef_search | lim | rerank_topk | p50 latency | Recall@10 |
|---|---|---|---|---|---|
| Fastest top-1 | 32 | 1 | 1 | 0.51ms | — |
| Fast top-5 | 64 | 5 | 5 | 0.87ms | 98.8% |
| Fast top-10 | 96 | 10 | 10 | 1.25ms | 98.6% |
| Balanced top-10 | 96 | 10 | 20 | 1.35ms | 99.8% |
| Safe top-10 | 96 | 10 | 48 | 1.64ms | 99.8% |
| Rerank-all | 96 | 10 | 0 | 6.94ms | 99.8% |

**Tuning `rerank_topk` for lowest latency:** set `rerank_topk = max(lim, 20)`
for 99.8% recall with minimal TOAST reads. Each TOAST read fetches one full
svec(D) row (~11.5 KB for 2880-dim), so fewer reads = lower latency. The SQ8
cache navigates accurately enough that reranking just 20 candidates already
achieves 99.8% recall — no need for 48 or more.

SQ8 quantizes float32 → uint8 per dimension in the session-local cache (4x
memory savings). The streaming two-pass build avoids allocating a float32
intermediate buffer, so peak memory is just the SQ8 cache itself (283 MB for
103K × 2880-dim). This runs comfortably on 2 Gi pods. Set
`sorted_heap.hnsw_cache_sq8 = off` only when memory is abundant and you need
zero-rerank operation.

Measured with `shared_buffers=512MB` (2 Gi pod), warm cache, 50 queries.
Requires `sorted_heap.hnsw_cache_l0 = on`. Cold first-call latency is 2–3x
higher due to TOAST page faults and cache build.

### Dense r1 pre-filter (`rerank1_topk`)

An optional intermediate stage using a `{prefix}_r1 (nid int4 PK, rerank_vec
hsvec(768))` sidecar. Set `rerank1_topk > 0` to enable. The r1 stage scores
all `ef_search` candidates via hsvec(768) cosine, keeps the closest
`Max(rerank1_topk, lim)`, then passes those to exact svec rerank.

**On a warm TOAST pool, r1 provides marginal benefit.** At ef=64, r1=24 saves
~0.3 ms but costs ~0.12 recall (9.74→9.62). At ef≥96 the r1 btree overhead
exceeds the TOAST savings. r1 is most useful in cold-TOAST scenarios (first
query of a session, or very large datasets where TOAST pages don't fit in
shared_buffers).

If `{prefix}_r1` does not exist, the stage is silently skipped.

---

## API reference

### Training

| Function | Description |
|---|---|
| `svec_ann_train(query, nlist, m)` | Train IVF + PQ codebooks in one call |
| `svec_ivf_train(query, nlist)` | Train IVF centroids only |
| `svec_pq_train(query, m)` | Train raw PQ codebook |
| `svec_pq_train_residual(query, m, ivf_cb_id)` | Train residual PQ codebook |

### Encoding

| Function | Description |
|---|---|
| `svec_ivf_assign(vec, cb_id)` | Assign vector to nearest IVF centroid → int2 |
| `svec_pq_encode(vec, cb_id)` | Encode vector as PQ code → bytea |
| `svec_pq_encode_residual(vec, centroid_id, pq_cb_id, ivf_cb_id)` | Encode residual as PQ code → bytea |

### Search

| Function | Description |
|---|---|
| `svec_hnsw_scan(tbl, query, prefix, ef_search, lim, rerank_topk, rerank1_topk)` | Hierarchical HNSW via sidecar tables (sub-ms with cache) |
| `svec_graph_scan(tbl, query, graph_tbl, entries_tbl, ef_search, lim, rerank_topk)` | Flat NSW graph search |
| `svec_ann_scan(tbl, query, nprobe, lim, rerank_topk, cb_id, ivf_cb_id, pq_column)` | C-level IVF-PQ scan |
| `svec_ann_search(tbl, query, nprobe, lim, rerank_topk, cb_id)` | SQL-level IVF-PQ search |
| `svec_ivf_probe(vec, nprobe, cb_id)` | Return nearest nprobe centroid IDs |

### Low-level distance

| Function | Description |
|---|---|
| `svec_pq_distance_table(vec, cb_id)` | Precompute M×256 distance table → bytea |
| `svec_pq_distance_table_residual(vec, centroid_id, pq_cb_id, ivf_cb_id)` | Distance table for residual PQ |
| `svec_pq_adc_lookup(dist_table, pq_code)` | ADC distance from precomputed table |
| `svec_pq_adc(vec, pq_code, cb_id)` | ADC distance (builds table internally) |
| `svec_cosine_distance(a, b)` | Exact cosine distance (also available as `<=>`) |