# pg_sorted_heap

A PostgreSQL extension that physically sorts data by primary key, prunes
irrelevant blocks via per-page zone maps, and includes built-in vector types
plus planner-integrated HNSW search.

## Release surface

### Stable

- `sorted_heap` table AM: physically sorted storage, zone-map pruning,
  offline/online compact and merge, scan stats, rebuild helpers, eager/lazy
  update modes.
- `sorted_hnsw` Index AM: planner-integrated KNN for `svec` and `hsvec`,
  shared decoded cache, exact rerank inside the index scan.
- Fact-shaped GraphRAG API:
  - `sorted_heap_graph_rag(...)`
  - `sorted_heap_graph_register(...)`
  - `sorted_heap_graph_config(...)`
  - `sorted_heap_graph_unregister(...)`
  - `sorted_heap_graph_rag_stats()`
  - `sorted_heap_graph_rag_reset_stats()`
- This stable GraphRAG surface is intentionally narrow:
  - fact rows clustered by `(entity_id, relation_id, target_id)` or an
    equivalent registered alias mapping
  - one-hop through multi-hop retrieval via `relation_path`
  - `relation_path` is a per-hop relation sequence such as
    `ARRAY[1]`, `ARRAY[1, 2]`, or `ARRAY[1, 2, 3, 4, 5]`
  - `score_mode := 'endpoint' | 'path'`

### Stable (routed GraphRAG)

- `sorted_heap_graph_route(...)` — unified routed query dispatcher
- `sorted_heap_graph_route_plan(...)` — routing introspection/explain
- Canonical-flow setup helpers:
  - `sorted_heap_graph_exact_register(...)` / `_unregister(...)`
  - `sorted_heap_graph_segment_register(...)` / `_unregister(...)`
  - `sorted_heap_graph_route_profile_register(...)` / `_unregister(...)`
  - `sorted_heap_graph_route_default_register(...)` / `_unregister(...)`
  - `sorted_heap_graph_route_policy_register(...)` / `_unregister(...)`
  - `sorted_heap_graph_segment_meta_register(...)` / `_unregister(...)`
- Edge-case contract:
  - zero-shard resolution returns empty results
  - ambiguous/conflicting routed inputs are rejected

### Beta

- Lower-level routed building blocks (`_routed`, `_routed_exact`,
  `_routed_policy`, `_routed_profile`, `_routed_default` and variants,
  `_segmented`, config/resolve helpers, policy group helpers, and catalog
  helpers)
- Lower-level GraphRAG expand/rerank helpers and scan wrappers
- Code-corpus and snippet-oriented GraphRAG contracts that currently live
  in benchmark/reference logic
- FlashHadamard experimental retrieval path:
  - SQL surface in `sql/flashhadamard_experimental.sql`
  - current canonical point: exhaustive parallel engine scan on the
    `103K x 2880D` Gutenberg workload
  - `FH_INT16=1` is an Apple/NEON-only experimental optimization,
    not the default execution path
  - `pthread` inside the backend remains experimental architecture debt;
    this path is documented, benchmarked, and tested, but not promoted to
    the stable `0.13` release surface

### Legacy/manual

- IVF-PQ via `svec_ann_scan(...)` / `svec_ann_search(...)`
- sidecar HNSW via `svec_hnsw_scan(...)`

## When to use pg_sorted_heap

### Time-series and event logs

Data arrives ordered by time. sorted_heap keeps it physically sorted and skips
irrelevant time ranges at the I/O level. At 100M rows a point query reads
**1 buffer** (vs 8 for btree, 520K for seq scan).

```sql
CREATE TABLE events (
    ts    timestamptz,
    src   text,
    data  jsonb,
    PRIMARY KEY (ts, src)
) USING sorted_heap;

-- Bulk load (COPY path sorts by PK automatically)
COPY events FROM '/path/to/events.csv';
SELECT sorted_heap_compact('events');

-- Range query: reads only blocks that contain the time range
SELECT * FROM events
WHERE ts BETWEEN '2026-01-01' AND '2026-01-02'
  AND src = 'sensor-42';
-- Custom Scan (SortedHeapScan): 3 of 12,500 blocks (pruned 12,497)
```

### IoT and sensor data

Composite PK `(device_id, ts)` clusters readings by device. Zone map tracks
both columns -- queries on either or both benefit from pruning.

```sql
CREATE TABLE readings (
    device_id  int,
    ts         timestamptz,
    value      float8,
    PRIMARY KEY (device_id, ts)
) USING sorted_heap;

-- Insert millions of readings via COPY, then compact
SELECT sorted_heap_compact('readings');

-- Point lookup by device + time: 1-2 blocks
SELECT * FROM readings WHERE device_id = 1042 AND ts = '2026-03-01 12:00:00';

-- Range by device: zone map prunes all other devices
SELECT avg(value) FROM readings
WHERE device_id = 1042
  AND ts BETWEEN '2026-03-01' AND '2026-03-07';
```

### Planner-integrated vector search (`sorted_hnsw`)

Built-in `svec` (float32, up to 16K dims) and `hsvec` (float16, up to 32K
dims) can now use a real PostgreSQL Index AM. The default ANN path is:

```sql
CREATE TABLE documents (
    id        bigserial PRIMARY KEY,
    embedding svec(384),
    content   text
);

CREATE INDEX documents_embedding_idx
ON documents USING sorted_hnsw (embedding)
WITH (m = 16, ef_construction = 200);

SET sorted_hnsw.shared_cache = on;  -- requires shared_preload_libraries='pg_sorted_heap'
SET sorted_hnsw.ef_search = 96;

SELECT id, content
FROM documents
ORDER BY embedding <=> '[0.1,0.2,0.3,...]'::svec
LIMIT 10;
```

On constrained builders, the current low-memory build knob is:

```sql
SET sorted_hnsw.build_sq8 = on;
```

That builds `sorted_hnsw` from SQ8-compressed build vectors instead of a full
float32 build slab. It costs one extra heap scan during `CREATE INDEX`, but on
the current local `1M x 64D` GraphRAG point it preserved quality and slightly
improved build time.

This is planner-integrated KNN search: no sidecar prefix argument, no manual
rerank knob, and exact rerank happens inside the index scan.

If you want the base table footprint closer to pgvector `halfvec`, use native
`hsvec` instead of upcasting to `svec`:

```sql
CREATE TABLE documents_compact (
    id        bigserial PRIMARY KEY,
    embedding hsvec(384),
    content   text
);

CREATE INDEX documents_compact_embedding_idx
ON documents_compact USING sorted_hnsw (embedding hsvec_cosine_ops)
WITH (m = 16, ef_construction = 200);
```

### GraphRAG and fact graphs (stable fact-shaped API)

`pg_sorted_heap` now has a stable narrow GraphRAG surface for fact-shaped
multihop queries. The intended workload is:

- ANN seed retrieval on `entity_id`
- 1-hop or 2-hop expansion over facts clustered by `(entity_id, relation_id, target_id)`
- exact rerank of the expanded candidates

The stable entry point is the unified fact-graph wrapper:

Minimal stable shape:

```sql
CREATE TABLE facts (
    entity_id   int4,
    relation_id int2,
    target_id   int4,
    embedding   svec(384),
    payload     text,
    PRIMARY KEY (entity_id, relation_id, target_id)
) USING sorted_heap;

CREATE INDEX facts_embedding_idx
ON facts USING sorted_hnsw (embedding)
WITH (m = 24, ef_construction = 200);

SET sorted_hnsw.ef_search = 128;

SELECT *
FROM sorted_heap_graph_rag(
    'facts'::regclass,
    '[0.1,0.2,0.3,...]'::svec,
    relation_path := ARRAY[1, 2],
    ann_k := 64,
    top_k := 10,
    score_mode := 'path'
);
```

If your fact table uses different column names, register the mapping once:

```sql
SELECT sorted_heap_graph_register(
    'facts_alias'::regclass,
    entity_column := 'src_id',
    relation_column := 'edge_type',
    target_column := 'dst_id',
    embedding_column := 'vec',
    payload_column := 'body'
);
```

`relation_path := ARRAY[1]` gives one-hop expansion. `relation_path := ARRAY[1,2]`
gives two-hop expansion. `relation_path := ARRAY[1,2,3,4,5]` gives a five-hop
path with explicit per-hop relation filters. `score_mode := 'endpoint'` ranks
only the final-hop facts; `score_mode := 'path'` accumulates evidence across
the whole path. For one-hop calls, `score_mode := 'path'` is intentionally the
same as `endpoint`. `limit_rows := 0` means unlimited work; positive values cap
expansion/rerank work inside the current GraphRAG helper stages rather than
changing the final `top_k` contract.

For multi-shard workloads (tenant isolation, knowledge-base routing),
use the unified routed dispatcher:

```sql
-- Setup: register shards and map tenant keys
SELECT sorted_heap_graph_register('shard_a'::regclass, ...);
SELECT sorted_heap_graph_exact_register('tenants', 'acme',
    'shard_a'::regclass, 100);

-- Query: one entry point for all routing modes
SELECT *
FROM sorted_heap_graph_route(
    'tenants',
    '[0.1,0.2,0.3,...]'::svec,
    relation_path := ARRAY[1, 2],
    route_key := 'acme',
    ann_k := 64,
    top_k := 10,
    score_mode := 'path'
);

-- Inspect: see which shards would be selected
SELECT * FROM sorted_heap_graph_route_plan(
    'tenants', route_key := 'acme');
```

`sorted_heap_graph_route(...)` dispatches to the appropriate routing path
(exact-key, range, profile, policy, or default) based on the provided
parameters. See `docs/api.md` for the full resolution order and operator
recipe.

The lower-level routing building blocks (`_routed`, `_routed_exact`,
`_routed_policy`, `_routed_profile`, `_routed_default`, `_segmented`,
expand/rerank helpers, scan wrappers, catalog/config/resolve functions)
remain available as beta APIs for advanced use cases. See the Beta
section above for the full list.

For tuning and debugging, GraphRAG now also exposes backend-local last-call
stats:

```sql
SELECT sorted_heap_graph_rag_reset_stats();

SELECT *
FROM sorted_heap_graph_rag(
    'facts'::regclass,
    '[0.1,0.2,0.3,...]'::svec,
    relation_path := ARRAY[1, 2],
    ann_k := 64,
    top_k := 10,
    score_mode := 'path'
);

SELECT *
FROM sorted_heap_graph_rag_stats();
```

The stats record includes seed count, expanded rows, reranked rows, returned
rows, and per-stage timing for ANN, expansion, rerank, and total time. The
`api` field reports the concrete GraphRAG execution path, so the unified
wrapper may report `sorted_heap_graph_rag_twohop_path_scan` or another
underlying helper/wrapper path.

For `0.13` hardening, the fact-graph path now also has dedicated lifecycle
coverage for:

- upgrade + dump/restore
- persistence of routed/segmented GraphRAG registries across dump/restore,
  including shared shard metadata, shared `segment_labels`, and effective
  default `segment_labels`
- crash recovery
- concurrent DML during `sorted_heap_compact_online(...)`
- concurrent DML during `sorted_heap_merge_online(...)`

On the current AWS ARM64 rerun (`4 vCPU`, `8 GiB RAM`), `5K` chains / `10K`
rows / `384D`, the current portable point is:

- `m=24`
- `ef_construction=200`
- `ann_k=64`
- `sorted_hnsw.ef_search=128`

which gives, on the current path-aware two-hop contract:

- `sorted_heap_expand_twohop_path_rerank()` -> `0.955 ms`, `hit@1 98.4%`, `hit@k 98.4%`
- `sorted_heap_graph_rag_twohop_path_scan()` -> `1.018 ms`, `hit@1 98.4%`, `hit@k 98.4%`
- `pgvector` + heap expansion -> `1.422 ms`, `hit@1 85.9%`, `hit@k 85.9%`
- `zvec` + heap expansion -> `1.720 ms`, `hit@1 100.0%`, `hit@k 100.0%`
- `Qdrant` + heap expansion -> `3.435 ms`, `hit@1 100.0%`, `hit@k 100.0%`

At the same knobs, the older city-only contract stays around `0.95-1.01 ms`
but drops to `hit@1 75.0%` / `hit@k 96.9%`, so the main quality gain is now
coming from the scorer contract rather than from a slower seed frontier.
On the apples-to-apples path-aware contract, `sorted_heap` is the latency
leader, while `zvec` and Qdrant still hold the strongest observed answer
quality.

The local M-series tuning run found a slightly different frontier, but the new
path-aware helper transferred cleanly to AWS at both `5K` and `10K` chains.
The full tuning history, reasoning, external-engine caveats, and larger-scale
results live in
[docs/design-graphrag.md](/Users/sergey/Projects/C/clustered_pg/docs/design-graphrag.md).
The current `0.13` release target and hardening gates live in
[docs/graphrag-0.13-plan.md](/Users/sergey/Projects/C/clustered_pg/docs/graphrag-0.13-plan.md).

There is now also a **real code-corpus** GraphRAG benchmark using the actual
`cogniformerus` source tree plus the real CrossFile prompts from
`butler_code_test.cr`. On that corpus, the current result is not one universal
winner but a split frontier:

- generic embedding mode:
  - `prompt_summary_snippet_py`
  - local repeated-build median `0.613 ms`, AWS repeated-build median `0.955 ms`
  - stable `100.0%` keyword coverage / `100.0%` full hits
- code-aware embedding mode:
  - `prompt_symbol_summary_snippet_py`
  - local repeated-build median `0.963 ms`, AWS repeated-build median `1.541 ms`
  - stable `100.0%` keyword coverage / `100.0%` full hits

The key correction is that the code-aware path needed exact prompt-symbol
rescue (`HierarchicalMemory`, `TwoStageAnswerer`, `DialogueNLU`, etc.) in the
summary seed stage; the generic path did not. The repeated-build and AWS
results are documented in
[docs/design-graphrag.md](/Users/sergey/Projects/C/clustered_pg/docs/design-graphrag.md)
and summarized in
[docs/benchmarks.md](/Users/sergey/Projects/C/clustered_pg/docs/benchmarks.md).

A larger in-repo transfer gate now also exists on the full
`cogniformerus` repository (`183` Crystal files instead of the smaller
`40`-file synced slice). At the old `top_k=4` budget, both winner contracts
drift to about `87%` keyword coverage and `66.7%` full hits. Raising only the
final result budget to `top_k=8` restores repeated-build stable
`100.0% / 100.0%` on that larger corpus:

- generic `prompt_summary_snippet_py`
  - local repeated-build median `0.819 ms`
  - stable `100.0%` keyword coverage / `100.0%` full hits
- code-aware `prompt_symbol_summary_snippet_py`
  - local repeated-build median `1.814 ms`
  - stable `100.0%` keyword coverage / `100.0%` full hits

So the larger in-repo Crystal-side transfer gate is now verified, but it is
slightly more result-budget-sensitive than the smaller synced slice.

### Legacy/manual ANN paths

The older explicit ANN paths are still available when you want manual control
over recall/latency tradeoffs or compressed storage:

- `svec_ann_scan(...)` / `svec_ann_search(...)` for IVF-PQ
- `svec_hnsw_scan(...)` for sidecar-table HNSW experiments and manual rerank

Those paths are still documented, but they are no longer the default vector
story for this repository.

### Large-table analytics with range predicates

Any workload where queries filter on the primary key benefits from zone map
pruning. sorted_heap eliminates the need for a btree index on the PK --
the zone map replaces it with ~30% less storage.

```sql
CREATE TABLE orders (
    order_id   bigint PRIMARY KEY,
    customer   int,
    total      numeric,
    created_at timestamptz
) USING sorted_heap;

-- After compact: point query reads 1 block, range query reads exact range
-- No btree index needed -- zone map handles PK predicates directly
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE order_id BETWEEN 50000 AND 50100;
-- Custom Scan (SortedHeapScan): 1 of 8,000 blocks (pruned 7,999)
```

## UPDATE modes: eager vs lazy

sorted_heap has two zone map maintenance modes for UPDATEs, controlled by the
`sorted_heap.lazy_update` GUC. The mode is **never activated automatically** --
you choose it explicitly based on your workload.

### Eager mode (default)

```sql
SET sorted_heap.lazy_update = off;  -- default, no action needed
```

Every UPDATE that changes a zone map entry flushes the meta page to disk (WAL
write). This keeps the zone map accurate for scan pruning at all times, but
adds overhead per UPDATE: ~46% of heap throughput for small-column updates,
~100% for large-vector updates (where TOAST write dominates).

**Use when:** read-heavy workloads that depend on zone map scan pruning
(range queries, analytics). Occasional UPDATEs are fine -- the overhead
matters only at high UPDATE throughput.

### Lazy mode

```sql
SET sorted_heap.lazy_update = on;  -- per session
-- or globally:
ALTER SYSTEM SET sorted_heap.lazy_update = on;
SELECT pg_reload_conf();
```

The first UPDATE on a covered page invalidates the zone map on disk (one WAL
write). All subsequent UPDATEs skip zone map maintenance entirely. The planner
falls back to Index Scan (btree) instead of Custom Scan (zone map pruning).
INSERT always uses eager maintenance regardless of this setting.

**Use when:** write-heavy workloads (high UPDATE throughput), or when point
lookups use btree Index Scan anyway (no range queries). Compact or merge
restores zone map pruning when needed:

```sql
-- Restore zone map pruning after a batch of updates
SELECT sorted_heap_compact('t');
-- or incrementally:
SELECT sorted_heap_merge('t');
```

### Decision guide

| Workload | Mode | Why |
|----------|------|-----|
| Append-only (INSERT + range SELECT) | Eager | Zone map pruning is the main benefit |
| Mixed OLTP (UPDATE + point SELECT) | Lazy | Point lookups use btree; UPDATE needs speed |
| Bulk ETL (INSERT, then query) | Eager | Compact after load, queries use zone map |
| Write-heavy + periodic analytics | Lazy | Lazy during writes, compact before analytics |

### Per-transaction control

```sql
-- Temporarily switch to lazy for a batch update
BEGIN;
SET LOCAL sorted_heap.lazy_update = on;
UPDATE large_table SET status = 'done' WHERE batch_id = 42;
COMMIT;
-- Outer session remains in eager mode
```

## How it works

`sorted_heap` keeps data physically ordered by primary key:

1. **Sorted bulk insert** -- `multi_insert` (COPY path) sorts each batch by PK
   before writing to heap. Produces physically sorted runs.

2. **Zone maps** -- Block 0 stores per-page `(col1_min, col1_max, col2_min,
   col2_max)` for the first two PK columns. Unlimited capacity via overflow
   page chain. Supported types: int2/4/8, timestamp, timestamptz, date, uuid,
   text/varchar (`COLLATE "C"`).

3. **Compaction** -- `sorted_heap_compact(regclass)` does a full CLUSTER
   rewrite; `sorted_heap_merge(regclass)` does incremental two-way merge of
   sorted prefix + unsorted tail. Both have online (non-blocking) variants.

4. **Scan pruning** -- A planner hook injects a `SortedHeapScan` custom path
   when WHERE has PK predicates. The executor uses `heap_setscanlimits()` to
   physically skip pruned blocks. Supports constants, prepared statements,
   `IN`/`ANY(array)`, and `LATERAL`/NestLoop runtime parameters.

```
COPY -> sort by PK -> heap insert -> update zone map
                                          |
compact/merge -> rewrite -> rebuild zone map -> set valid flag
                                                    |
SELECT WHERE pk op const -> planner hook -> extract bounds
    -> zone map lookup -> block range -> heap_setscanlimits -> skip I/O
```

## Performance

Use [docs/benchmarks.md](/Users/sergey/Projects/C/clustered_pg/docs/benchmarks.md)
for the full benchmark matrix and methodology. The summary below keeps only the
current representative rows and avoids older narrow-range comparisons that no
longer reflect the release story cleanly.

### Query latency (representative)

PostgreSQL 18, Apple M-series, `shared_buffers=4GB`, warm cache.
`100M` rows (`7.8 GB` sorted_heap, `7.8 GB` heap+btree):

| Query | sorted_heap | heap+btree | heap seqscan |
|-------|------------|-----------|-------------|
| Point (1 row) | 0.045ms / **1 buf** | 0.506ms / 8 bufs | 1,190ms / 519K bufs |
| Medium (5K) | 0.479ms / 38 bufs | 0.812ms / 58 bufs | 1,326ms / 519K bufs |
| Wide (100K) | 7.9ms / 737 bufs | 10.1ms / 1,017 bufs | 1,405ms / 518K bufs |

For very narrow hot-range queries, btree can still tie or slightly edge out
`sorted_heap`; the win here is lower I/O and stronger scaling once the working
set grows. The full benchmark doc keeps those narrower rows for context.

### Throughput (pgbench, 10s, 1 client, prepared mode)

| Query | 1M (sh/btree) | 10M (sh/btree) | 100M (sh/btree) |
|-------|:---:|:---:|:---:|
| Point | 46.9K/59.4K | 46.5K/58.0K | 32.6K/43.6K |
| Wide (100K) | 295/289 | 293/286 | 168/157 |

At 100M rows with simple mode, sorted_heap wins all query types: point 4x
(18.7K vs 4.6K TPS).

### CRUD performance (500K rows, svec(128), prepared mode)

| Operation | eager / heap | lazy / heap | Notes |
|-----------|:-----------:|:-----------:|-------|
| SELECT PK | 85% | 85% | Index Scan via btree |
| SELECT range 1K | 97% | -- | Custom Scan pruning |
| Bulk INSERT | 100% | 100% | Always eager |
| DELETE + INSERT | 63% | 63% | INSERT always eager |
| UPDATE non-vec | 46% | **100%** | Lazy skips zone map flush |
| UPDATE vec col | 102% | **100%** | Parity both modes |
| Mixed OLTP | 83% | **97%** | Near-parity with lazy |

**Eager mode** (default): maintains zone maps on every UPDATE for scan pruning.
**Lazy mode** (`SET sorted_heap.lazy_update = on`): trades scan pruning for
UPDATE parity with heap. Compact/merge restores pruning. Recommended for
write-heavy workloads.

### Vector search comparison

Current repo-owned harnesses:

- `python3 scripts/bench_gutenberg_local_dump.py --dump /tmp/cogniformerus_backup/cogniformerus_backup.dump --port 65473`
- `REMOTE_PYTHON=/path/to/python SH_EF=32 EXTRA_ARGS='--sh-ef-construction 200' ./scripts/bench_gutenberg_aws.sh <aws-host> /path/to/repo /path/to/dump 65485`
- `scripts/bench_sorted_hnsw_vs_pgvector.sh /tmp 65485 10000 20 384 10 vector 64 96`
- `python3 scripts/bench_ann_real_dataset.py --dataset nytimes-256 --sample-size 10000 --queries 20 --k 10 --pgv-ef 64 --sh-ef 96 --zvec-ef 64 --qdrant-ef 64`
- `python3 scripts/bench_qdrant_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64`
- `python3 scripts/bench_zvec_synthetic.py --rows 10000 --queries 20 --dim 384 --k 10 --ef 64`

Restored Gutenberg dump on an AWS ARM64 host (4 vCPU,
8 GiB RAM), top-10. Ground truth is recomputed by exact PostgreSQL heap
search on the restored `svec` table; in the current rerun the stored
`bench_hnsw_gt` table matched that exact GT on 100% of the 50 benchmark
queries, so the fresh heap GT and the historical GT table agree. This rerun
uses `sorted_hnsw` with `ef_construction=200` and `ef_search=32`, and the
measurement phase reconnects after build before timing ordered scans.

| Method | p50 latency | Recall@10 | Notes |
|--------|:-----------:|:---------:|-------|
| Exact heap (`svec`) | 458.762ms | 100.0% | brute-force GT on restored corpus |
| **`sorted_hnsw` (`svec`)** | **1.287ms** | **100.0%** | `ef_construction=200`, `ef_search=32`, index 404 MB, total 1902 MB |
| `sorted_hnsw` (`hsvec`) | 1.404ms | 100.0% | `ef_construction=200`, `ef_search=32`, index 404 MB, total 1032 MB |
| pgvector HNSW (`halfvec`) | 2.031ms | 99.8% | `ef_search=64`, index 804 MB, total 1615 MB |
| zvec HNSW | 50.499ms | 100.0% | in-process collection, `ef=64`, ~1.12 GiB on disk |
| Qdrant HNSW | 6.028ms | 99.2% | local Docker on same AWS host, `hnsw_ef=64`, 103,260 points |

The precision-matched AWS row is `sorted_hnsw (hsvec)` vs pgvector
`halfvec`: `1.404ms @ 100.0%` versus `2.031ms @ 99.8%`, with total footprint
`1032 MB` versus `1615 MB`. The raw fastest PostgreSQL row on this corpus is
still `sorted_hnsw (svec)` at `1.287ms`, but that compares float32 source
storage against float16 storage. The `sorted_hnsw` index stays 404 MB for both
`svec` and `hsvec` because the index stores SQ8 graph state; the space win
from `hsvec` appears in the base table and TOAST footprint (`1902 MB -> 1032
MB`), not in the index itself.

Synthetic 10K x 384D cosine corpus, top-10, warm query loop. PostgreSQL
methods were rerun across 3 fresh builds and the table below reports median
`p50` / median recall. Qdrant uses 3 warm measurement passes on one local
Docker collection.

| Method | p50 latency | Recall@10 | Notes |
|--------|:-----------:|:---------:|-------|
| Exact heap (`svec`) | 2.03ms | 100% | Brute-force ground truth |
| **sorted_hnsw** | **0.158ms** | **100%** | `shared_cache=on`, `ef_search=96`, index ~5.4 MB |
| pgvector HNSW (`vector`) | 0.446ms | 90% median (90-95 range) | `ef_search=64`, same `M=16`, `ef_construction=64`, index ~2.0 MB |
| zvec HNSW | 0.611ms | 100% | local in-process collection, `ef=64` |
| Qdrant HNSW | 1.94ms | 100% | local Docker, `hnsw_ef=64` |

Real-dataset sample (`nytimes-256-angular`, sampled 10K x 256D, top-10). The
table below reports medians across 3 full harness runs. Ground truth is exact
heap search inside PostgreSQL on the sampled corpus, not numpy-side cosine.

| Method | p50 latency | Recall@10 | Notes |
|--------|:-----------:|:---------:|-------|
| Exact heap (`svec`) | 1.557ms | 100% | ground truth |
| **sorted_hnsw** | **0.327ms** | **85.0% median** (83.5-85.5 range) | `shared_cache=on`, `ef_search=96`, index ~4.1 MB |
| pgvector HNSW (`vector`) | 0.751ms | 79.0% median (78.5-79.0 range) | `ef_search=64`, same `M=16`, `ef_construction=64`, index ~13 MB |
| zvec HNSW | 0.403ms | 99.5% | local in-process collection, `ef=64`, ~14.1 MB on disk |
| Qdrant HNSW | 1.704ms | 99.5% | local Docker, `hnsw_ef=64` |

The synthetic corpus is an easy case; the `nytimes-256` sample is a much
harsher recall test at the same fixed HNSW settings. Treat the synthetic table
as an upper-bound speed/quality shape, not the only baseline.

## Quick start

### Requirements

- PostgreSQL 17 or 18
- Standard PGXS build toolchain (`pg_config` in PATH)

### Build and install

```bash
make && make install
```

### Create a table

```sql
CREATE EXTENSION pg_sorted_heap;

CREATE TABLE events (
    id      int PRIMARY KEY,
    ts      timestamptz,
    payload text
) USING sorted_heap;

-- Bulk load (COPY path sorts by PK automatically)
INSERT INTO events
SELECT i, now() - (i || ' seconds')::interval, repeat('x', 80)
FROM generate_series(1, 100000) i;

-- Compact to globally sort and build zone map
SELECT sorted_heap_compact('events');

-- Zone map pruning kicks in automatically
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM events WHERE id BETWEEN 500 AND 600;
-- Custom Scan (SortedHeapScan)
-- Zone Map: 2 of 1946 blocks (pruned 1944)
```

### Run tests

```bash
make installcheck              # regression tests
make test-crash-recovery       # crash recovery (4 scenarios)
make test-graphrag-crash       # GraphRAG crash recovery
make test-graphrag-concurrent  # GraphRAG concurrent DML + online ops
make test-graphrag-release     # GraphRAG release-candidate bundle
make test-release              # full extension release-candidate bundle
make test-concurrent           # concurrent DML + online ops
make test-toast                # TOAST integrity
make test-alter-table          # ALTER TABLE DDL (36 checks)
make test-dump-restore         # pg_dump/restore lifecycle
make test-graphrag-lifecycle   # GraphRAG upgrade + dump/restore lifecycle
make test-graph-builder        # graph sidecar bootstrap + rebuild smoke
make test-pg-upgrade           # pg_upgrade 17->18
make policy-safety-selftest    # policy + doc contract checks
make pg-core-regression-smoke  # PG core regression smoke test
make selftest-lightweight      # lightweight selftest suite
```

Command selection quick map: see [OPERATIONS.md](OPERATIONS.md) for the full
list of available make targets and their descriptions.

## SQL API

### Compaction

```sql
-- Offline compact: full CLUSTER rewrite (AccessExclusiveLock)
SELECT sorted_heap_compact('t');

-- Online compact: non-blocking (ShareUpdateExclusiveLock,
-- brief AccessExclusiveLock for final swap)
CALL sorted_heap_compact_online('t');

-- Offline merge: sorted prefix + unsorted tail (faster than full compact)
SELECT sorted_heap_merge('t');

-- Online merge
CALL sorted_heap_merge_online('t');
```

### Zone map inspection

```sql
SELECT sorted_heap_zonemap_stats('t');   -- flags, entry count, ranges
SELECT sorted_heap_rebuild_zonemap('t'); -- manual rebuild
```

### Scan statistics

```sql
SELECT * FROM sorted_heap_scan_stats();  -- total_scans, blocks_scanned, blocks_pruned
SELECT sorted_heap_reset_stats();        -- reset counters
```

### Vector types

```sql
-- svec: float32, up to 16,000 dimensions
CREATE TABLE items (id text PRIMARY KEY, embedding svec(768)) USING sorted_heap;

-- hsvec: float16, up to 32,000 dimensions (half storage)
CREATE TABLE items_h (id text PRIMARY KEY, embedding hsvec(768)) USING sorted_heap;

-- Cosine distance: <=> operator
SELECT a.embedding <=> b.embedding FROM items a, items b WHERE a.id='a' AND b.id='b';
```

### IVF-PQ search (legacy/manual path)

```sql
-- PQ-only (fastest, ~8ms)
SELECT * FROM svec_ann_scan('tbl', query, nprobe:=3, lim:=10, cb_id:=1, ivf_cb_id:=1);

-- With exact reranking (higher recall, ~22ms)
SELECT * FROM svec_ann_scan('tbl', query, nprobe:=10, lim:=10, rerank_topk:=200,
                            cb_id:=1, ivf_cb_id:=1);
```

### HNSW index (`sorted_hnsw`)

```sql
CREATE INDEX items_embedding_idx
ON items USING sorted_hnsw (embedding)
WITH (m = 16, ef_construction = 200);

SET sorted_hnsw.shared_cache = on;
SET sorted_hnsw.ef_search = 96;

SELECT id
FROM items
ORDER BY embedding <=> query
LIMIT 10;
```

Current ordered-scan contract:

- The automatic `sorted_hnsw` path is for base-relation
  `ORDER BY embedding <=> query LIMIT k` queries.
- The planner does not use it for the current Phase 1 contract when there is
  no `LIMIT`, when `LIMIT > sorted_hnsw.ef_search`, or when extra base-table
  quals would make the index under-return candidates.
- For filtered expansion workflows, materialize/filter first or use the
  GraphRAG helper/wrapper API below.

### HNSW sidecar search (`svec_hnsw_scan`, legacy/manual path)

```sql
SET sorted_heap.hnsw_cache_l0 = on;  -- session-local sidecar cache

-- Fastest top-1: 0.51ms (SQ8 cache, 1 TOAST read)
SELECT * FROM svec_hnsw_scan('tbl', query, 'tbl_hnsw',
    ef_search:=32, lim:=1, rerank_topk:=1);

-- Fast top-10: 1.19ms, 98.6% recall (rerank = lim)
SELECT * FROM svec_hnsw_scan('tbl', query, 'tbl_hnsw',
    ef_search:=96, lim:=10, rerank_topk:=10);

-- Balanced top-10: 1.52ms, 99.8% recall (rerank = 2x lim)
SELECT * FROM svec_hnsw_scan('tbl', query, 'tbl_hnsw',
    ef_search:=96, lim:=10, rerank_topk:=20);
```

### GraphRAG (stable fact-shaped API)

```sql
CREATE TABLE facts (
    entity_id   int4,
    relation_id int2,
    target_id   int4,
    embedding   svec(384),
    payload     text,
    PRIMARY KEY (entity_id, relation_id, target_id)
) USING sorted_heap;

CREATE INDEX facts_embedding_idx
ON facts USING sorted_hnsw (embedding)
WITH (m = 24, ef_construction = 200);

SET sorted_hnsw.ef_search = 128;

-- One-hop fact retrieval
SELECT *
FROM sorted_heap_graph_rag(
    'facts'::regclass,
    '[0.1,0.2,0.3,...]'::svec,
    relation_path := ARRAY[1],
    ann_k := 64,
    top_k := 10,
    score_mode := 'endpoint'
);

-- Two-hop path-aware retrieval
SELECT *
FROM sorted_heap_graph_rag(
    'facts'::regclass,
    '[0.1,0.2,0.3,...]'::svec,
    relation_path := ARRAY[1, 2],
    ann_k := 64,
    top_k := 10,
    score_mode := 'path'
);
```

Recommended release positioning for `0.13`:

- **stable**: the unified fact-shaped API above
- **beta**: lower-level wrappers/helper composition and the FlashHadamard
  experimental retrieval lane
- **reference logic**: code-corpus snippet/symbol/lexical contracts used by
  the benchmark harnesses

### Configuration

```sql
SET sorted_heap.enable_scan_pruning = on;       -- zone map pruning (default: on)
SET sorted_heap.vacuum_rebuild_zonemap = on;     -- VACUUM rebuilds zone map (default: on)
SET sorted_heap.lazy_update = on;                -- skip per-UPDATE zone map flush
SET sorted_hnsw.shared_cache = on;               -- shared decoded cache for sorted_hnsw
SET sorted_hnsw.ef_search = 96;                  -- sorted_hnsw beam width
SET sorted_heap.hnsw_cache_l0 = on;              -- session-local HNSW L0+upper cache
SET sorted_heap.hnsw_cache_sq8 = on;             -- SQ8 quantize svec in cache (4x smaller)
SET sorted_heap.hnsw_ef_patience = 0;            -- adaptive ef early termination (0=off)
SET sorted_heap.ann_timing = on;                 -- timing breakdown in DEBUG1
```

## Architecture

| File | Purpose |
|------|---------|
| `sorted_heap.h` | Meta page layout, zone map structs (v7), SortedHeapRelInfo |
| `sorted_heap.c` | Table AM: sorted multi_insert, zone map persistence, compact, merge, vacuum |
| `sorted_heap_scan.c` | Custom scan: planner hook, parallel scan, multi-col pruning, IN/ANY, runtime params |
| `sorted_heap_online.c` | Online compact + merge: trigger-based copy, replay, swap |
| `pg_sorted_heap.c` | Extension entry, legacy clustered index AM, GUC registration |
| `svec.h` / `svec.c` | svec type (float32): I/O, typmod, NEON cosine distance `<=>` |
| `hsvec.h` / `hsvec.c` | hsvec type (float16): I/O, cosine distance, NEON SIMD, casts |
| `pq.h` / `pq.c` | PQ, IVF, ANN scan, sidecar HNSW search, graph scan, beam search |

### Zone map details

- **v7 format**: 32-byte entries with col1 + col2 min/max per page, persisted
  sorted prefix count
- Meta page (block 0): 250 entries in special space
- Overflow pages: 254 entries/page, linked list (no capacity limit)
- Updated atomically via GenericXLog during multi_insert
- Autovacuum rebuilds zone map when validity flag is not set

### Custom scan provider

- Hooks into `set_rel_pathlist_hook`
- Extracts PK bounds from WHERE clauses (constants, params, IN/ANY)
- Binary search on monotonically sorted zone map, linear scan otherwise
- `heap_setscanlimits(start, nblocks)` for physical block skip
- Parallel-aware: `add_partial_path` + Gather for multi-worker scans
- EXPLAIN: `Zone Map: N of M blocks (pruned P)`

## Limitations

- Zone map tracks first two PK columns. Supported types: int2/4/8,
  timestamp(tz), date, uuid, text/varchar (`COLLATE "C"`).
- Online compact/merge not supported for UUID/text/varchar PKs.
- UPDATE does not re-sort; use compact/merge periodically.
- Automatic `sorted_hnsw` ordered scans currently target base-relation
  `ORDER BY embedding <=> query LIMIT k` queries. Filtered retrieval flows
  should use explicit materialization or the GraphRAG helper/wrapper surface.
- `heap_setscanlimits()` only supports contiguous block ranges.
- pg_dump/restore: compact needed after restore.
- pg_upgrade 17 to 18: tested and verified.

## Documentation

- [Quick Start](docs/quickstart.md)
- [Vector Search](docs/vector-search.md) -- `sorted_hnsw`, IVF-PQ, sidecar HNSW
- [Architecture](docs/architecture.md) -- zone maps, custom scan, compaction
- [SQL API](docs/api.md) -- full function reference
- [Benchmarks](docs/benchmarks.md) -- latency, throughput, vector search
- [GraphRAG 0.13 Plan](docs/graphrag-0.13-plan.md) -- stable target and hardening gates
- [Release 0.13.0 Draft](docs/release-0.13.0.md) -- repo-owned GitHub release note draft
- [FlashHadamard Note](docs/flashhadamard-note.md) -- experimental retrieval lane and canonical operating points
- [FlashHadamard Retrospective](docs/flashhadamard-retrospective.md) -- what was proved, refuted, and still unproven
- [Limitations](docs/limitations.md)
- [Changelog](CHANGELOG.md)
- [Operations](OPERATIONS.md) -- make targets and diagnostics

## License

Released under the [PostgreSQL License](LICENSE).