# kazsearch — Kazakh Stemmer for Elasticsearch Analysis plugin that adds a `kazsearch_stem` token filter for Kazakh full-text search. BFS suffix-stripping with vowel harmony enforcement — the first Kazakh stemmer for Elasticsearch. ## Compatibility | Plugin version | Elasticsearch | Java | Architecture | |---|---|---|---| | 0.1.0 | 8.17.x | 21+ | linux/aarch64, linux/x86_64, darwin/aarch64 | ## Quick Start The install script auto-downloads the latest release from GitHub: ```bash # Download latest release + install into local ES: ./elastic/install.sh # Or a specific version: ./elastic/install.sh --version 2.1.0 # Or Docker (no ES install needed): ./elastic/install.sh --docker ``` The release zip is self-contained — Java jar + native Rust library for both x86_64 and aarch64. ## Install Methods ### Method 1: Auto-download (recommended) ```bash # Downloads latest from GitHub Releases, installs into ES_HOME ./elastic/install.sh sudo systemctl restart elasticsearch ``` ### Method 2: Manual download Grab `analysis-kazsearch-.zip` from [GitHub Releases](https://github.com/darkhanakh/pg-kazsearch/releases), then: ```bash bin/elasticsearch-plugin install file:///path/to/analysis-kazsearch-2.1.0.zip sudo systemctl restart elasticsearch ``` ### Method 3: Docker ```bash # From release: ./elastic/install.sh --docker --version 2.1.0 # Or from source: ./elastic/install.sh --docker # Then run: docker run -d --name es-kazsearch \ -p 9200:9200 \ -e discovery.type=single-node \ -e xpack.security.enabled=false \ kazsearch-elastic:2.1.0 ``` ### Method 4: Build from source Requirements: Rust toolchain, Java 21, Gradle. ```bash ./elastic/install.sh --build sudo systemctl restart elasticsearch ``` ## Usage ### 1. Create an index with the Kazakh analyzer ```json PUT /articles { "settings": { "analysis": { "filter": { "kaz_stem": { "type": "kazsearch_stem" } }, "analyzer": { "kazakh": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase", "kaz_stem"] } } } }, "mappings": { "properties": { "title": { "type": "text", "analyzer": "kazakh" }, "body": { "type": "text", "analyzer": "kazakh" } } } } ``` ### 2. Test the analyzer ```bash curl -X POST "localhost:9200/articles/_analyze" \ -H 'Content-Type: application/json' \ -d '{"analyzer": "kazakh", "text": "мектептерде оқушылар"}' ``` Result: `["мектеп", "оқушы"]` — suffixes stripped, stems returned. ### 3. Index documents ```bash curl -X POST "localhost:9200/articles/_doc" \ -H 'Content-Type: application/json' \ -d '{"title": "Мектептерде жаңа оқу жылы", "body": "Оқушылар жаңа оқулықтарды алды."}' ``` ### 4. Search with any inflected form ```bash # Searching "мектептегі" (at school) finds docs containing # "мектептерде" (in schools), "мектепте" (at school), etc. curl -X POST "localhost:9200/articles/_search" \ -H 'Content-Type: application/json' \ -d '{"query": {"match": {"body": "мектептегі оқушыларды"}}}' ``` ## What It Does Kazakh is agglutinative — a single word can stack 5–6 suffixes: ``` мектептерінде = мектеп + тер (plural) + і (possessive) + нде (locative) "in their schools" ``` The plugin stems all inflected forms to a common root, so search works regardless of which grammatical form appears in the document or query: | Query form | Document form | Both stem to | |---|---|---| | мектептерде | мектептің | мектеп | | оқушыларды | оқушылар | оқушы | | дәрігерлерге | дәрігерлердің | дәрігер | | технологияларды | технологиясын | технология | ## Recommended Index Pattern For best results, index with both a stemmed and a standard (unstemmed) sub-field: ```json { "mappings": { "properties": { "title": { "type": "text", "analyzer": "kazakh", "fields": { "exact": { "type": "text", "analyzer": "standard" } } } } } } ``` This lets you boost exact matches while still getting stemmed recall: ```json { "query": { "multi_match": { "query": "Қазақстан экономикасы", "fields": ["title^3", "title.exact^5", "body"] } } } ``` ## Architecture ``` ┌─────────────────────────────────────────────┐ │ Elasticsearch 8.17 │ │ │ │ ┌──────────────────────────────────────┐ │ │ │ analysis-kazsearch plugin (Java) │ │ │ │ │ │ │ │ KazakhStemTokenFilter │ │ │ │ └─► KazakhStemmerNative (JNI) │ │ │ │ └─► libkazsearch_elastic.so │ │ │ │ └─► kazsearch-core │ │ │ │ (pure Rust) │ │ │ └──────────────────────────────────────┘ │ └─────────────────────────────────────────────┘ ``` The stemmer logic lives in `kazsearch-core` (pure Rust, no ES/PG dependencies). The same core powers both the PostgreSQL extension and this Elasticsearch plugin. ## Troubleshooting **Plugin fails to load with `UnsatisfiedLinkError`:** The native library for your platform is missing from the zip. Rebuild with `./scripts/build_elastic_native.sh` on a machine matching your ES server's OS/arch, then `gradle bundlePlugin`. **No stemming effect (tokens unchanged):** Make sure your index analyzer uses the `kazsearch_stem` filter. Test with `_analyze` API first. **ES won't start after install:** Check ES version matches (8.17.x). Check `elasticsearch.log` for the exact error. Run `bin/elasticsearch-plugin remove analysis-kazsearch` to uninstall.