# Design: `sorted_hnsw` Shared Scan Cache ## Status Implemented in: - `f33d620` `perf: share decoded sorted_hnsw scan cache across backends` - `27e77a3` `perf: cost sorted_hnsw shared-cache startup separately` The original upper-level-only payload was a useful falsifier, but it only cut fresh 50k/384D scans by about 10% and missed the bar. The implemented version therefore promotes the shared slot to a full decoded immutable scan snapshot. ## Problem `sorted_hnsw` already keeps a decoded scan cache per backend in [`src/sorted_hnsw.c`](../src/sorted_hnsw.c), keyed by `{relid, relfilenode, cache_gen}`. That solves repeated scans inside one backend, but every new backend still pays to rebuild the same immutable scan-side state: - SQ8 min/scale arrays - upper-level adjacency - upper-level `nid -> index` lookup tables - scan-cache control structs The new fixed-graph harness can now measure the gap directly: - `fresh` mode: one ordered scan per new backend - `reuse` mode: one warmup query, then repeated scans in a single backend Representative pre-shared-cache 384D result: - `fresh avg_ms = 0.5803` - `reuse avg_ms = 0.3007` So roughly half of the observed fresh-backend latency is still startup work, not the search itself. ## Goal Reduce cross-backend cold-start cost for ordered `sorted_hnsw` scans without changing search semantics or weakening crash safety. ## Non-goals - Do not change on-disk page layout. - Do not make writable-path behavior depend on shared-memory mutation. - Do not share exact rerank heap/TOAST state. - Do not start with a general-purpose multi-index cache manager if a smaller falsifier can prove or kill the idea first. ## Current architecture Today each backend does: 1. Read metapage. 2. Read SQ8 auxiliary pages. 3. Decode upper levels into backend-local `palloc()` allocations. 4. Build upper-level lookup tables. 5. Keep L0 decoded lazily per backend. That means `fresh` mode repeatedly duplicates the same immutable decode work. ## Implemented Phase A: single-slot shared immutable cache Start with one experimental shared cache slot, effective only when the extension is loaded via `shared_preload_libraries`. Why a single slot first: - it is enough to validate the measured bottleneck on the current benchmark - it keeps invalidation and memory budgeting simple - it gives a hard falsifier before building a more complex multi-slot cache ### What to share Share immutable decoded scan-side data: - SQ8 mins - SQ8 scales - decoded L0 node headers - decoded L0 neighbor slab - decoded L0 SQ8 vectors - upper-level adjacency - upper-level `nid -> index` arrays - immutable header fields: `M`, `dim`, `n_nodes`, `entry_nid`, `max_level`, `cache_gen`, `relid`, `locator` Keep these backend-local: - L0 decoded pages and neighbor arrays - visited bitsets - candidate/result scratch - writable-path insert helpers - rerank heap/slot state ### Why this split The upper-level-only split was safe but insufficient. The dominant remaining fresh-backend tax was the decoded L0 snapshot itself, so the implemented slot stores the full immutable decoded graph and relies on `cache_gen` invalidation to keep writable-path updates from reusing stale state. ## Shared memory layout Use the existing shared-memory hook pattern already present in [`src/sorted_heap_scan.c`](../src/sorted_heap_scan.c). Add: - one control struct in main shared memory - one fixed-size payload arena reserved at startup - one LWLock protecting publish/replace of the slot Use offsets inside the payload arena rather than raw `palloc()`-style pointer ownership. ### Control struct sketch ```c typedef struct ShnswSharedScanCacheCtl { slock_t mutex_or_lwlock; bool valid; Oid relid; RelFileLocator locator; uint64 cache_gen; int32 dim; int16 M; int16 max_level; int32 n_nodes; int32 entry_nid; Size payload_bytes; Size sq8_mins_off; Size sq8_scales_off; Size upper_entries_off[SHNSW_MAX_LEVELS]; Size upper_nbr_idx_off[SHNSW_MAX_LEVELS]; int32 upper_count[SHNSW_MAX_LEVELS]; } ShnswSharedScanCacheCtl; ``` ### Initial payload budget Start with a fixed budget just for the experimental slot. Current payload budget: `64 MB`. If the decoded snapshot does not fit, publication bails out and the backend falls back to the normal private decode path. ## Backend behavior On scan-cache lookup: 1. Read `cache_gen` from the metapage. 2. Check the shared slot under lock. 3. If `{relid, locator, cache_gen}` match and payload is present: - attach backend-local lightweight wrappers to shared immutable arrays - skip SQ8/upper decode 4. Otherwise: - build the normal backend-local cache - materialize any deferred L0 pages - if the shared slot is empty or stale and the payload fits, publish the decoded immutable snapshot into shared memory On insert, vacuum tombstone, rebuild, relfilenode swap, or reindex: - metapage already bumps `cache_gen` - next backend lookup sees mismatch and treats the shared slot as stale - stale slots are replaced lazily on the next eligible scan ## Correctness constraints - shared cache must never survive a `cache_gen` mismatch - any `cache_gen` mismatch forces a miss - payload publication must be all-or-nothing - fallback to backend-local decode must remain correct when shmem is disabled, full, or stale ## Falsifier The branch cleared the original gate: - `fresh` 50k/384D fixed-graph average: - `shared_cache=off -> 1.9213 ms` - `shared_cache=on -> 0.6337 ms` - warm same-backend 50k/384D: - `shared_cache=off -> 0.2180 ms` - `shared_cache=on -> 0.0917 ms` - `make installcheck` green - `scripts/test_sorted_hnsw_crash_recovery.sh` green - fresh-backend insert invalidation check: - nearest before insert `833:0.000258` - nearest after insert `2001:0.000000` That is enough to keep the branch and stop treating it as speculative. ## Validation plan Use the new benchmark split from [`scripts/bench_sorted_hnsw_fixed_graph.sh`](../scripts/bench_sorted_hnsw_fixed_graph.sh): - `fresh`: cross-backend cold-start cost - `reuse`: steady-state search cost Required checks: - `make installcheck REGRESS='pg_sorted_heap sorted_hnsw'` - `./scripts/test_sorted_hnsw_crash_recovery.sh /tmp ` - `./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp 2000 8 384 fresh on off` - `./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp 2000 8 384 fresh on on` - `./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp 2000 8 384 reuse on off` - `./scripts/bench_sorted_hnsw_fixed_graph.sh /tmp 2000 8 384 reuse on on` ## If Phase A works Only then consider: - more than one shared slot - LRU/size-based replacement - optional shared L0 decode for read-mostly indexes - explicit observability counters for shared hits/misses/publishes ## If Phase A fails Stop local cache-manager work and accept that the next performance branch is either: - a larger shared-cache redesign with DSA/DSM, or - planner/policy work that avoids paying fresh-backend cost in short-lived sessions