> **Plain-language companion:** [v0.40.0.md](v0.40.0.md) ## v0.40.0 — Embedding API & Advanced RAG Patterns **Status: Planned.** Derived from [plans/ecosystem/PLAN_PGVECTOR.md](plans/ecosystem/PLAN_PGVECTOR.md) §6.4 (VA-1 through VA-5). > **Release Theme** > v0.40.0 consolidates the pgvector integration programme (v0.37–v0.39) into > a production-ready, ergonomic surface layer. The `embedding_stream_table()` > API enables one-function RAG corpus setup; materialised k-NN graph research > explores fixed-pivot retrieval patterns; per-tenant ANN indexing is > documented as a production pattern; and co-marketing with pgvector/pgai > positions pg_trickle as the default incremental-view solution for AI on > PostgreSQL. --- ### Features | ID | Title | Effort | Priority | |----|-------|--------|----------| | VA-1 | `embedding_stream_table()` ergonomic API: one-call RAG corpus setup | L | P1 | | VA-2 | Materialised k-NN graph research: explore fixed-pivot retrieval *(parallel research spike — not a release gate)* | L | P3 | | VA-3 | Per-tenant ANN indexing patterns: RLS-scoped embedding corpora | M | P2 | | VA-4 | Outbox-emitted embedding events for downstream consumption | M | P2 | | VA-5 | Joint case study / co-marketing with pgvector and pgai teams *(best-effort, not a release gate)* | — | P1 | **VA-1 — `embedding_stream_table()` ergonomic API.** Add a high-level function that auto-generates the full denormalization query, creates the stream table, indexes, and monitoring: ```sql SELECT pgtrickle.embedding_stream_table( name => 'product_search_corpus', source_table => 'products', embedding_column => 'embedding', -- single column name or ARRAY[...] joins => ARRAY[ ('product_tags', 'product_id', 'tag', 'name'), -- auto-JOIN syntax ('product_metadata', 'product_id', 'metadata') ], aggregates => ARRAY['category', 'price_bin'], -- GROUP BY cols filters => 'active = true AND seller_rating > 4', schedule => '10 seconds', refresh_mode => 'DIFFERENTIAL', index_type => 'hnsw', -- or 'ivfflat' post_refresh_action => 'reindex_if_drift', reindex_drift_threshold => 0.15, vector_agg => 'vector_avg', -- if aggregating over embeddings retention => '90 days' -- optional PITR window ); ``` Under the hood, this constructs: ```sql SELECT p.id, p.name, p.description, p.embedding, array_agg(pt.tag) AS tags, pm.metadata, p.category, p.price_bin FROM products p LEFT JOIN product_tags pt ON pt.product_id = p.id LEFT JOIN product_metadata pm ON pm.product_id = p.id WHERE p.active AND p.seller_rating > 4 GROUP BY p.id, p.category, p.price_bin, pm.metadata ``` And then: 1. Creates the stream table with the inferred schema. 2. Creates an HNSW index on the embedding column. 3. Creates B-tree or GIN indexes on join-result columns as needed. 4. Configures post-refresh actions, monitoring, and retention. 5. Returns a summary JSON: `{created_at, st_name, embedding_dim, approx_rows, index_size_mb}`. Benefits: - 80% of RAG users don't need to write SQL — this covers their use case. - Maintains SQL-first flexibility for power users (they use `create_stream_table` directly). - Lowers the barrier to entry for AI/ML engineers unfamiliar with PostgreSQL query authoring. Required design constraints: - **Show generated SQL (P1):** The function must accept a `dry_run => true` parameter that returns the generated `CREATE STREAM TABLE` + `CREATE INDEX` SQL without executing it. This is essential for expert users who want to audit or customise the output. - **Index-inference heuristics (P1):** When `index_type` is omitted, document the inference rule explicitly: HNSW is chosen for `vector` and `halfvec` columns; IVFFlat is chosen only when explicitly requested. The heuristic and its rationale must be documented in the function's SQL comment and in `docs/SQL_REFERENCE.md`. **VA-2 — Materialised k-NN graph (research spike).** > **Scope change (assessment 2026-04-27):** VA-2 has been reclassified from a > v0.40.0 release item to a **parallel research spike**. The roadmap itself > acknowledged it is "likely not production-ready in v0.40.0." Keeping it as > a release gate on a Large release with uncertain scope creates risk. Open a > dedicated research tracking issue; VA-2 findings may graduate to v0.41.0 or > later. T-VA2 is optional/informational in v0.40.0. Explore whether pre-computing a k-NN graph can accelerate approximate retrieval for fixed query pivots: - Given a set of N embedding vectors and a set of P pivot vectors, maintain a stream table of `(pivot_id, embedding_id, distance)` tuples representing the k-nearest embeddings to each pivot. - Application queries: `SELECT * FROM knn_graph WHERE pivot_id = 42 ORDER BY distance LIMIT 10`. Research questions: - Storage/compute trade-off: is maintaining k·N rows cheaper than an HNSW index scan? - Clustering lifecycle: when pivots change, how do we efficiently rebuild the graph? - Heterogeneous k: can we keep the k-NN graph up-to-date when k varies per pivot? Outcome: a research document (`docs/research/MATERIALISED_KNN.md`) and optionally a proof-of-concept implementation if the trade-off is favorable. Likely not production-ready in v0.40.0; may be promoted to v0.41.0 or later. **VA-3 — Per-tenant ANN indexing patterns.** Document and exemplify the pattern for multi-tenant RAG: ```sql -- Create a tenant-scoped embedding stream table using RLS (v0.5) SELECT pgtrickle.create_stream_table( 'tenant_embeddings_scoped', $$ SELECT e.id, e.embedding, e.tenant_id, e.chunk_text FROM embeddings e WHERE e.tenant_id = current_setting('app.current_tenant_id')::int $$, rls_context => 'app.current_tenant_id', schedule => '5 seconds' ); CREATE INDEX ON tenant_embeddings_scoped USING hnsw (embedding vector_cosine_ops); -- Application queries automatically filtered by RLS: SELECT id FROM tenant_embeddings_scoped WHERE embedding <=> $1 < 0.1 LIMIT 20; ``` Pattern documentation: `docs/tutorials/MULTI_TENANT_RAG.md`. Security audit checklist: verify that RLS policies are correctly applied to stream tables, that cross-tenant data never leaks, and that the ST inherits permissions from the source table. **VA-4 — Outbox-emitted embedding events.** Extend the outbox (v0.28) to emit embedding-change events: ```sql SELECT pgtrickle.create_stream_table( 'embedding_changes_outbox', $$ SELECT doc_id, old_embedding, new_embedding, 'EMBEDDING_UPDATED' AS event_type FROM docs_embedded_audit $$, outbox_name => 'embedding_outbox' ); ``` Downstream systems (AI agents, re-rankers, external embedding services) can subscribe to the outbox and react to embedding changes. Enables event-driven architecture for RAG applications. **VA-5 — Joint case studies and co-marketing.** > **Best-effort only (assessment 2026-04-27):** VA-5 depends on third-party > coordination (pgvector and pgai team willingness, timing, publishing > schedules). It must not be a blocking exit criterion for the v0.40.0 release. > If outreach has not resulted in a commitment by feature-freeze, ship with > an internal cookbook and a public solo blog post instead. Partner with pgvector and pgai teams: - **Blog post:** "PostgreSQL as a Vector Database: pg_trickle + pgvector + pgai" (cross-published on blogs). - **Case study:** Real production system (anonymized as needed) maintaining a document-embedding corpus with pg_trickle + pgvector + pgai, serving semantic search + LLM context retrieval. - **Benchmark:** TPC-H with hybrid-search queries (vector similarity + SQL filters + BM25 if using pg_search) on pg_trickle stream tables. Measure latency, throughput, and resource consumption. - **Repository:** Create a public example repo (`pg-trickle-rag-starter`) with Docker Compose, sample data, and working RAG application code (Python + LangChain / LlamaIndex). Strategic positioning: pg_trickle is the incremental-maintenance layer for AI/RAG on PostgreSQL. Combined with pgvector (indexing) and pgai (embedding generation), users get a complete, open-source, self-hosted RAG stack with no external vector database. ### Test Coverage | ID | Title | Effort | Priority | |----|-------|--------|----------| | T-VA1 | Integration test: `embedding_stream_table()` auto-generates correct ST | M | P1 | | T-VA2 | k-NN graph proof of concept: verify throughput trade-off | L | P2 | | T-VA3 | Multi-tenant security test: RLS policies applied to embedding STs | M | P1 | | T-VA4 | Outbox event emission for embedding changes | M | P2 | **T-VA1.** `tests/e2e_embedding_api_tests.rs`: 1. Call `embedding_stream_table()` with a realistic joins + aggregates spec. 2. Verify the generated stream table exists with the expected schema. 3. Verify indexes are created correctly. 4. Insert new source rows; verify stream table refreshes correctly. 5. Query the stream table and assert results match the manually-written equivalent `create_stream_table()` with identical query. **T-VA2.** `tests/e2e_knn_graph_research_tests.rs` (may be optional / research-only): 1. Create embedding stream table with 100k rows. 2. Define a set of 10 pivot vectors (e.g., cluster centroids). 3. Maintain a k-NN graph ST listing the 50 nearest embeddings to each pivot. 4. Measure: row count, index size, refresh latency vs. direct HNSW index. 5. Benchmark: single-pivot query via k-NN graph vs. direct `LIMIT 50` on HNSW index. **T-VA3.** `tests/e2e_multi_tenant_embedding_tests.rs`: 1. Create two tenant contexts (tenant_id 1 and 2) with separate embedding data. 2. Create an embedding stream table with RLS policy: `WHERE tenant_id = current_setting('app.current_tenant_id')::int`. 3. Connect as tenant 1; query stream table; verify only tenant 1's embeddings returned. 4. Connect as tenant 2; query stream table; verify only tenant 2's embeddings returned. 5. Verify that an admin can see the full table (or is restricted appropriately per policy). **T-VA4.** `tests/e2e_embedding_outbox_tests.rs`: 1. Create an embedding stream table with `outbox_name` configured. 2. Insert source rows; verify outbox contains the corresponding embedding records. 3. Verify downstream subscriber receives the outbox events. ### Conflicts & Risks - **VA-1** (embedding_stream_table API) is syntactic sugar; it hides complexity and may lead to suboptimal queries if users don't understand the generated SQL. Mitigation: `dry_run => true` shows the generated SQL (P1 requirement); index-inference heuristics are documented explicitly. - **VA-2** (k-NN graph) is a parallel research spike; it does not block the release. See the scope-change note above. - **VA-3** (multi-tenant RLS) must be audited carefully. RLS can be bypassed by a careless admin. Include security best practices in the documentation. - **VA-5** (co-marketing) depends on willingness from pgvector/pgai teams. Outreach should begin during v0.38 timeframe. See the best-effort note above. ### Exit Criteria - [ ] VA-1: `embedding_stream_table()` function implemented, generates correct SQL, creates indexes and monitoring - [ ] VA-1: `dry_run => true` parameter returns generated SQL without executing it - [ ] VA-1: Index-inference heuristic (HNSW default for vector/halfvec; IVFFlat only if explicit) documented in SQL comment and `docs/SQL_REFERENCE.md` - [ ] VA-1: Function auto-generates query for 80%+ of common RAG patterns (single source + 2–3 joins + GROUP BY) - [ ] VA-1: Calling `embedding_stream_table()` produces identical results to manually-written equivalent - [ ] VA-2: Research spike tracking issue opened; `docs/research/MATERIALISED_KNN.md` published with trade-off analysis *(optional: T-VA2 PoC if favorable)* - [ ] VA-3: `docs/tutorials/MULTI_TENANT_RAG.md` published with security checklist - [ ] VA-3: `tests/e2e_multi_tenant_embedding_tests.rs` passes (RLS policies enforced) - [ ] VA-4: Outbox events emitted correctly on embedding stream table inserts - [ ] VA-5: *(best-effort)* Blog post published (co-published or pg_trickle solo); `pg-trickle-rag-starter` repo available on GitHub - [ ] Extension upgrade path tested (`0.39.0 → 0.40.0`) - [ ] `just check-version-sync` passes ---