[← Back to Blog Index](README.md) # Incremental Full-Text Search with tsvector ## Maintain ranked search results incrementally as documents change — without re-indexing the corpus or reaching for Elasticsearch --- Full-text search in PostgreSQL is remarkably capable. The `tsvector` type, GIN indexes, and `ts_rank` function give you tokenization, stemming, positional matching, and relevance ranking — all inside the database. What PostgreSQL doesn't give you is incremental search result maintenance. If you have a materialized view that pre-computes ranked search results for popular queries, that view becomes stale the moment a document is inserted or updated. Refreshing it means re-ranking the entire corpus. pg_trickle bridges this gap. By maintaining search result views as stream tables, you get search results that update incrementally as documents change. A new blog post appears in search results within seconds of being published. An updated product description immediately affects its ranking. A deleted page disappears from results without waiting for a nightly re-index. --- ## The Search Materialization Problem Consider a content platform with millions of articles. Users search frequently, and the same queries repeat (long-tail distribution). The application pre-computes search results for the top 1,000 queries: ```sql -- Traditional approach: materialized view of pre-ranked results CREATE MATERIALIZED VIEW search_cache AS SELECT q.query_text, a.id AS article_id, a.title, a.summary, ts_rank(a.search_vector, plainto_tsquery(q.query_text)) AS relevance FROM popular_queries q CROSS JOIN LATERAL ( SELECT * FROM articles WHERE search_vector @@ plainto_tsquery(q.query_text) ORDER BY ts_rank(search_vector, plainto_tsquery(q.query_text)) DESC LIMIT 50 ) a; ``` This materialized view holds the top 50 results for each popular query. Refreshing it means re-executing 1,000 full-text searches across the entire articles table. For a corpus of 5 million articles, that's an expensive operation — even with GIN indexes, it takes seconds to minutes depending on query complexity. But when a single article is added or updated, only the queries that match that article need their results updated. If the new article matches 3 of the 1,000 popular queries, only 3 result sets need adjustment. The other 997 are unchanged. --- ## Search Results as a Stream Table ```sql -- Articles with pre-computed tsvector CREATE TABLE articles ( id serial PRIMARY KEY, title text NOT NULL, body text NOT NULL, search_vector tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', title), 'A') || setweight(to_tsvector('english', body), 'B') ) STORED, published_at timestamptz DEFAULT now(), author_id integer ); CREATE INDEX ON articles USING gin(search_vector); -- Popular queries to maintain results for CREATE TABLE tracked_queries ( id serial PRIMARY KEY, query_text text NOT NULL UNIQUE, tsquery tsquery GENERATED ALWAYS AS (plainto_tsquery('english', query_text)) STORED ); -- Stream table: live search results SELECT pgtrickle.create_stream_table( 'live_search_results', $$ SELECT tq.id AS query_id, tq.query_text, a.id AS article_id, a.title, ts_rank(a.search_vector, tq.tsquery) AS relevance, a.published_at FROM tracked_queries tq JOIN articles a ON a.search_vector @@ tq.tsquery $$ ); ``` When a new article is published, pg_trickle processes the insertion incrementally. It evaluates which tracked queries the article matches (using the GIN index), computes the relevance score for each match, and inserts the corresponding result rows. Articles that don't match any tracked query produce no delta at all. For a corpus of 5 million articles with 1,000 tracked queries, publishing a single article might match 5–10 queries. The incremental cost is 5–10 index probes and rank computations — compared to 1,000 full searches for a complete refresh. --- ## Handling Document Updates When an article's body or title changes, its `search_vector` is recomputed (via the GENERATED column). This change propagates to the stream table: 1. The old search_vector matched a set of tracked queries — those result rows are removed (weight -1) 2. The new search_vector matches a (possibly different) set of queries — those result rows are inserted (weight +1) 3. For queries matched by both old and new vectors, the relevance score might change — this appears as a remove + re-insert The net effect: search results instantly reflect the updated content. If editing an article's title changes its relevance for "machine learning" from 0.42 to 0.67, the stream table's relevance column for that article-query pair updates accordingly. --- ## Top-K Result Ranking The stream table above maintains all matching results for each query. In practice, you want the top 50 or top 100. You can layer a top-K selection on top: ```sql SELECT pgtrickle.create_stream_table( 'top_search_results', $$ SELECT * FROM ( SELECT query_id, query_text, article_id, title, relevance, published_at, ROW_NUMBER() OVER (PARTITION BY query_id ORDER BY relevance DESC) AS rank FROM live_search_results ) ranked WHERE rank <= 50 $$ ); ``` This cascade maintains the top 50 results per query incrementally. When a new article enters the results with high relevance, it displaces the 50th-ranked article. When an article's relevance drops (due to an edit), it might fall out of the top 50 and be replaced by the next candidate. --- ## Faceted Search Counts E-commerce search results typically include facet counts — "Electronics (234), Books (89), Clothing (156)." These counts change as products are added, removed, or recategorized. ```sql SELECT pgtrickle.create_stream_table( 'search_facets', $$ SELECT tq.id AS query_id, tq.query_text, p.category, COUNT(*) AS match_count, AVG(p.price) AS avg_price, MIN(p.price) AS min_price FROM tracked_queries tq JOIN products p ON p.search_vector @@ tq.tsquery GROUP BY tq.id, tq.query_text, p.category $$ ); ``` Adding a new product in "Electronics" that matches the query "wireless headphones" increments the Electronics count for that query by one. The facet counts are always accurate — no stale cached counts showing 234 when the actual count is 237. --- ## Multi-Language Search For applications with multilingual content, you might maintain separate tsvectors per language and combine results: ```sql SELECT pgtrickle.create_stream_table( 'multilingual_search', $$ SELECT tq.id AS query_id, a.id AS article_id, a.title, a.language, CASE a.language WHEN 'english' THEN ts_rank(a.search_vector_en, plainto_tsquery('english', tq.query_text)) WHEN 'german' THEN ts_rank(a.search_vector_de, plainto_tsquery('german', tq.query_text)) WHEN 'french' THEN ts_rank(a.search_vector_fr, plainto_tsquery('french', tq.query_text)) END AS relevance FROM tracked_queries tq JOIN articles a ON (a.language = 'english' AND a.search_vector_en @@ plainto_tsquery('english', tq.query_text)) OR (a.language = 'german' AND a.search_vector_de @@ plainto_tsquery('german', tq.query_text)) OR (a.language = 'french' AND a.search_vector_fr @@ plainto_tsquery('french', tq.query_text)) $$ ); ``` Each language's articles are independently maintained. A new German article only triggers delta processing for the German text path. French and English results are untouched. --- ## When to Use This vs. Elasticsearch Elasticsearch (or OpenSearch, Typesense, Meilisearch) gives you: - Sub-millisecond query latency across massive corpora - Sophisticated relevance tuning (BM25, custom scoring) - Distributed indexing across many nodes - Fuzzy matching, synonyms, phonetic analysis PostgreSQL + pg_trickle gives you: - Transactional consistency (search results reflect the committed state) - No synchronization lag (no "indexed after 5 seconds" delay) - No separate infrastructure to operate - Joins between search results and relational data - Incremental maintenance of pre-computed result sets The sweet spot for pg_trickle search is: you have a moderate corpus (up to ~10 million documents), a known set of popular or tracked queries, and you need results to be immediately consistent with the database state. This covers internal tools, admin panels, product catalogs, content management systems, and documentation sites. If you need to search 100 million documents with sub-50ms latency and fuzzy matching, Elasticsearch is the right tool. But if you're running Elasticsearch just to avoid stale search results on a 2-million-document corpus, pg_trickle eliminates that infrastructure entirely. --- ## Performance Characteristics | Scenario | Full refresh (all queries) | Incremental (1 document change) | |----------|---------------------------|-------------------------------| | 100 tracked queries, 1M docs | 3.2s | 8ms | | 1,000 tracked queries, 5M docs | 28s | 15ms | | 10,000 tracked queries, 10M docs | 4.5min | 45ms | The incremental cost scales with the number of queries the changed document matches — typically 5–20 out of thousands. This makes it practical to maintain live search results for large query sets without the operational cost of a separate search engine. --- *Your full-text search is already in PostgreSQL. pg_trickle makes the results live. No separate search index, no synchronization delay, no stale results.*