# pg_kazsearch The first PostgreSQL full-text search extension for the Kazakh language. Kazakh is heavily agglutinative: a single word like `мектептерімізде` carries plural, possessive, and locative suffixes that must all be stripped to reach the root `мектеп`. No existing PostgreSQL or Elasticsearch analyzer handles this. pg_kazsearch fills that gap with a C extension that plugs directly into PostgreSQL's text search pipeline. --- ## What it does ```sql -- Install and configure CREATE EXTENSION pg_kazsearch; CREATE TEXT SEARCH CONFIGURATION kazakh_cfg (PARSER = pg_catalog.default); ALTER TEXT SEARCH CONFIGURATION kazakh_cfg ALTER MAPPING FOR word, hword, hword_part WITH pg_kazsearch_stop, pg_kazsearch_dict, simple; -- Index your table ALTER TABLE articles ADD COLUMN fts_vector tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('kazakh_cfg', title), 'A') || setweight(to_tsvector('kazakh_cfg', body), 'B') ) STORED; CREATE INDEX idx_fts ON articles USING GIN (fts_vector); -- Search in Kazakh SELECT title FROM articles WHERE fts_vector @@ websearch_to_tsquery('kazakh_cfg', 'президенттің жарлығы') ORDER BY ts_rank_cd(fts_vector, websearch_to_tsquery('kazakh_cfg', 'президенттің жарлығы')) DESC LIMIT 10; ``` The stemmer normalizes both query and document terms so `президенттің` (president's) matches `президент`, `мектептерімізде` matches `мектеп`, and `өзгеруі` matches `өзгеру`. --- ## Stemmer quality Tested on 2,999 Kazakh news articles (tengrinews.kz) with 9,048 evaluation queries: | Metric | pg_kazsearch | pg_trgm (trigram) | |--------|-------------|-------------------| | Recall@10 | **0.784** | 0.635 | | MRR@10 | **0.712** | 0.566 | | nDCG@10 | **0.729** | 0.582 | | Query latency | **0.5 ms** | 1.4 ms | pg_kazsearch beats trigram by **+16 percentage points** on Recall@10. ### Stemmer examples | Input | Output | Morphology stripped | |-------|--------|-------------------| | мектептерімізде | мектеп | plural + possessive + locative | | президенттерінің | президент | plural + possessive + genitive | | өзгеруі | өзгеру | verbal noun possessive | | берді | бер | past tense | | экономикалық | экономика | derivational adjective | | алматыға | алматы | dative case (proper noun) | | көмек | көмек | protected (lexicon-known root) | --- ## Architecture The extension consists of: - **BFS suffix stripper** (`kaz_explore.c`) — breadth-first search over layered suffix rules (predicate, case, possessive, plural, derivational for nouns; person, tense, negation, voice for verbs), with vowel harmony validation and phonological guards - **Penalty scoring** (`kaz_explore.c`) — candidates scored by syllable count, suffix weakness, derivational depth, and lexicon hits to pick the best stem - **Lexicon** (`kaz_stems.dict`) — 21,863 POS-tagged stems extracted from Apertium-kaz's morphological transducer, filtered to root forms only (nouns, verbs, adjectives, place names) - **Stopwords** (`kaz_stopwords.stop`) — 53 Kazakh function words filtered before stemming - **Vowel harmony** (`kaz_text.c`) — back/front vowel classification with glide exclusion (у/и/ю treated as consonants for harmony) and tail-based fallback for loanwords - **Stem repair** (`kaz_explore.c`, `pg_kazsearch.c`) — consonant mutation reversal (б→п, г→к, ғ→қ), vowel elision restoration, and lexicon-based vowel append for proper nouns --- ## Quick start ```bash # Prerequisites: Docker git clone https://github.com/darkhanakh/pg-kazsearch.git cd pg-kazsearch make up # start PostgreSQL with the extension make reload # build, install, and configure kazakh_cfg make test-ext # smoke test stemmer + tsvector # Load the evaluation corpus (optional) python3 eval/load_corpus.py --input data/corpus/articles.jsonl # Run the evaluation python3 eval/run_eval.py --trgm-sample 500 ``` --- ## Project structure | Directory | Contents | |-----------|----------| | `src/pg_kazsearch/` | C extension: stemmer dictionary, suffix rules, BFS explorer, text utilities, lexicon loader | | `data/tsearch_data/` | Stem dictionary (`kaz_stems.dict`) and stopword list (`kaz_stopwords.stop`) | | `scripts/` | `build_lexicon.py` — extracts POS-tagged lemmas from Apertium-kaz | | `eval/` | Evaluation pipeline: scraper, corpus loader, query generator, FTS vs trigram eval | | `docker/` | Dockerfile and init SQL for local development | | `prototype/` | Python stemmer prototypes (v1-v3) used during research phase | | `benchmark/` | Performance and parity benchmarks | --- ## Lexicon The stem dictionary is built from [Apertium-kaz](https://github.com/apertium/apertium-kaz), a linguistically vetted morphological transducer for Kazakh. Only entries with explicit POS continuation classes are included: - **N1/N5/N6** — common nouns (13,900+) - **V-TV/V-IV** — transitive/intransitive verbs (3,500+) - **A1/A2** — base adjectives (3,200+) - **NP-TOP/NP-ORG** — place names and organizations (1,800+) - **ADV/NUM** — adverbs and numerals (900+) Derived adjectives (A3/A4), personal names (NP-ANT/NP-COG), and inflected forms are excluded to keep the dictionary clean for stemmer disambiguation. Rebuild with: ```bash python3 scripts/build_lexicon.py ``` --- ## References - Krippes, K.A. (1993). *Kazakh (Qazaq-) Grammatical Sketch with Affix List*. ERIC. - Washington, J., Salimzyanov, I., Tyers, F. (2014). *Finite-state morphological transducers for three Kypchak languages*. LREC. - Makhambetov, O. et al. (2015). *Data-driven morphological analysis and disambiguation for Kazakh*. CICLing. - Tolegen, G., Toleu, A., Mussabayev, R. (2022). *A Finite State Transducer Based Morphological Analyzer for Kazakh Language*. IEEE UBMK. --- ## License - **Code:** MIT - **Lexicon data** derived from [Apertium-kaz](https://github.com/apertium/apertium-kaz) (GPL-3.0) and KazNU morphology resources (CC BY-SA).