--- title: Ngram description: Splits text into small chunks called grams, useful for partial matching canonical: https://docs.paradedb.com/documentation/tokenizers/available-tokenizers/ngrams --- The ngram tokenizer splits text into "grams," where each "gram" is of a certain length. The tokenizer takes two arguments. The first is the minimum character length of a "gram," and the second is the maximum character length. Grams will be generated for all sizes between the minimum and maximum gram size, inclusive. For example, `pdb.ngram(2,5)` will generate tokens of size `2`, `3`, `4`, and `5`. To generate grams of a single fixed length, set the minimum and maximum gram size equal to each other. ```sql CREATE INDEX search_idx ON mock_items USING bm25 (id, (description::pdb.ngram(3,3))) WITH (key_field='id'); ``` To get a feel for this tokenizer, run the following command and replace the text with your own: ```sql SELECT 'Tokenize me!'::pdb.ngram(3,3)::text[]; ``` ```ini Expected Response text ------------------------------------------------- {tok,oke,ken,eni,niz,ize,"ze ","e m"," me",me!} (1 row) ``` ## Ngram Prefix Only The generate ngram tokens for only the first `n` characters in the text, set `prefix_only` to `true`. ```sql SELECT 'Tokenize me!'::pdb.ngram(3,3,'prefix_only=true')::text[]; ``` ```ini Expected Response text ------- {tok} (1 row) ``` ## Phrase and Proximity Queries with Ngram Because multiple ngram tokens can overlap, the ngram tokenizer does not store token positions. As a result, queries that rely on token positions like [phrase](/documentation/full-text/phrase), [phrase prefix](/documentation/query-builder/phrase/phrase-prefix), [regex phrase](/documentation/query-builder/phrase/regex-phrase) and [proximity](/documentation/full-text/proximity) are not supported over ngram-tokenized fields. An exception is if the min gram size equals the max gram size, which guarantees unique token positions. In this case, setting `positions=true` enables these queries. ```sql SELECT 'Tokenize me!'::pdb.ngram(3,3,'positions=true')::text[]; ``` ### Exact Substring Matching with Phrase Queries With `positions=true`, [phrase queries](/documentation/full-text/phrase) over ngram fields perform exact substring matching. This is faster than using [match conjunction](/documentation/full-text/match#match-conjunction) on an ngram field, which creates a `Must` clause for every ngram token and intersects them independently. A phrase query uses a single positional intersection instead. The tradeoff is that phrase queries are stricter: they require tokens at consecutive positions within a single field value, while match conjunction only requires all tokens to appear somewhere in the document. ```sql CREATE TABLE books (id SERIAL PRIMARY KEY, titles TEXT[]); INSERT INTO books (titles) VALUES (ARRAY['The Dragon Hatchling', 'Wings of Gold']), (ARRAY['Dragon Slayer', 'Hatchling Care']); CREATE INDEX ON books USING bm25 (id, (titles::pdb.ngram(4,4,'positions=true'))) WITH (key_field='id'); -- Phrase: matches exact substring "Dragon Hatchling" — only row 1 SELECT * FROM books WHERE titles ### 'Dragon Hatchling'; -- Match conjunction: matches all ngrams anywhere — also only row 1 here, -- but on larger datasets could match rows where the ngrams are scattered SELECT * FROM books WHERE titles ||| 'Dragon Hatchling'; DROP TABLE books; ``` When constructing queries as [JSON](/documentation/query-builder/json), use `tokenized_phrase` to achieve the same result as the `###` operator. It tokenizes the input string with the field's tokenizer and builds a phrase query from the resulting tokens: ```json { "tokenized_phrase": { "field": "titles", "phrase": "Dragon Hatchling" } } ```