--- title: Search Tokenizer description: Use a different tokenizer at search time than at index time canonical: https://docs.paradedb.com/documentation/tokenizers/search-tokenizer --- By default, ParadeDB uses the same tokenizer at both index time and search time. This makes sense for most cases — you want queries tokenized the same way the data was indexed. But sometimes you need different tokenizers. The classic example is **autocomplete**: - **Index time** — edge ngram: `"shoes"` → `s`, `sh`, `sho`, `shoe`, `shoes` - **Search time** — unicode: `"sho"` → `sho` If you used edge ngram at search time too, typing `"sho"` would produce `s`, `sh`, `sho` — matching far too many documents. ## Usage Set `search_tokenizer` as a `WITH` option on the index to define a default search-time tokenizer for all text and JSON fields: ```sql CREATE INDEX search_idx ON products USING bm25 ( id, (title::pdb.ngram(1, 10, 'prefix_only=true')) ) WITH (key_field='id', search_tokenizer='unicode_words'); ``` With this configuration: - **Index time**: `title` is tokenized with edge ngram to create prefix tokens - **Search time**: queries against `title` automatically use the unicode tokenizer The `search_tokenizer` value can include parameters, e.g. `search_tokenizer='simple(lowercase=false)'`. Because `search_tokenizer` only affects query-time behavior, you can change it without reindexing: ```sql ALTER INDEX search_idx SET (search_tokenizer = 'simple(lowercase=false)'); ``` ## Example ```sql CREATE TABLE products ( id serial8 NOT NULL PRIMARY KEY, title text ); INSERT INTO products (title) VALUES ('shoes'), ('shirt'), ('shorts'), ('shoelaces'), ('socks'); CREATE INDEX idx_products ON products USING bm25 (id, (title::pdb.ngram(1, 10, 'prefix_only=true'))) WITH (key_field = 'id', search_tokenizer = 'unicode_words'); -- "sho" stays as one token → matches shoes, shorts, shoelaces SELECT id, title FROM products WHERE title ||| 'sho' ORDER BY id; -- "s" stays as one token → matches all five titles SELECT id, title FROM products WHERE title ||| 's' ORDER BY id; ``` Without `search_tokenizer`, the query `'sho'` would be edge-ngrammed into `s`, `sh`, `sho` and match every title starting with `s` — not just those starting with `sho`. ## Overriding at Query Time You can still override the search tokenizer for a specific query by casting the query string: ```sql -- Force edge ngram tokenization at query time SELECT id, title FROM products WHERE title ||| 'sho'::pdb.ngram(1, 10, 'prefix_only=true') ORDER BY id; ``` ## Priority When resolving which tokenizer to use at search time, ParadeDB checks in this order: 1. **Query-level cast** — e.g. `'sho'::pdb.ngram(...)` (highest priority) 2. **Index-level WITH option** — e.g. `WITH (search_tokenizer='unicode_words')` 3. **Index-time tokenizer** — the tokenizer used to build the index (fallback) ## Supported Tokenizers Any [available tokenizer](/documentation/tokenizers/overview) can be used as a `search_tokenizer`: `unicode_words`, `simple`, `whitespace`, `ngram`, `literal`, `literal_normalized`, `chinese_compatible`, `lindera`, `icu`, `jieba`, `source_code`.