# kham_pg PostgreSQL text-search extension for the Thai language. Provides a custom parser, phonetic dictionary, and ready-to-use FTS configuration so Thai documents can be indexed and queried with `tsvector` / `tsquery`. Thai has no spaces between words. Standard PostgreSQL parsers treat an entire Thai sentence as one token. kham_pg uses the kham newmm segmentation engine to split Thai text into correct word boundaries, then expands each token into up to three lexemes at the same `tsvector` position: 1. The normalised word itself 2. Its lk82 Thai Soundex code — enables phonetic-fuzzy search 3. Its RTGS romanization — enables Latin-script search Named entities (persons, places, organisations) are tagged automatically. ## Install **Try Option 1 first.** Pre-built binaries require no compiler or Rust toolchain. Fall back to Option 2 only if a pre-built binary is not available for your platform or PostgreSQL version. --- ### Option 1 — Pre-built binary (recommended, no Rust required) Pre-compiled `.so` files are available for **Linux x86_64** and **Linux aarch64** (AWS Graviton, Ampere) for PostgreSQL 14–18 on the [GitHub Releases page](https://github.com/preedep/kham/releases/latest). **Prerequisites** | Requirement | Notes | |-------------|-------| | PostgreSQL 14–18 | Server must be installed; `pg_config` must be in `PATH` | | Linux x86\_64 or aarch64 | Pre-built binaries are Linux-only | **Steps** ```bash # 1. Unzip the PGXN distribution (provides control + SQL files) unzip kham_pg-0.6.0.zip cd kham_pg-0.6.0 # 2. Download the pre-built .so for your PG version and architecture # Replace PG=17 and ARCH=x86_64 as needed (14–18, x86_64 or aarch64) PG=17 ARCH=x86_64 VERSION=0.6.0 curl -fsSL \ "https://github.com/preedep/kham/releases/download/v${VERSION}/kham-pg-v${VERSION}-pg${PG}-${ARCH}-unknown-linux-gnu.tar.gz" \ | tar xz # extracts libkham_pg.so # 3. Install the .so and extension files (sudo required for system PG) PG_CONFIG=/usr/lib/postgresql/${PG}/bin/pg_config sudo install -m 755 libkham_pg.so "$($PG_CONFIG --pkglibdir)/kham_pg.so" sudo install -m 644 kham_pg.control "$($PG_CONFIG --sharedir)/extension/" sudo install -m 644 sql/kham_pg--${VERSION}.sql "$($PG_CONFIG --sharedir)/extension/" # 4. Load the extension in psql psql -c "CREATE EXTENSION kham_pg;" ``` --- ### Option 2 — Build from source (fallback) Use this path if no pre-built binary is available for your platform or PostgreSQL version. Supports any platform where PostgreSQL and Rust are available (Linux, macOS). **Prerequisites** | Requirement | Notes | |-------------|-------| | PostgreSQL 14–18 | `pg_config` must be in `PATH` or set via `PG_CONFIG` env var | | Rust 1.85+ | Install via [rustup.rs](https://rustup.rs) | | C compiler | `clang` or `gcc` | | Linux system packages | `build-essential postgresql-server-dev-N` (replace N with your PG major version) | | `brew install gettext` | **macOS only** — PostgreSQL headers require `libintl.h` | **Steps** **Linux (Debian / Ubuntu)** ```bash # Replace 17 with your PostgreSQL major version (14–18) PG=17 sudo apt-get install -y build-essential postgresql-server-dev-${PG} unzip kham_pg-0.6.0.zip cd kham_pg-0.6.0 PG_CONFIG=/usr/lib/postgresql/${PG}/bin/pg_config make install psql -c "CREATE EXTENSION kham_pg;" ``` **macOS (Homebrew)** ```bash # Replace 17 with your PostgreSQL major version (14–18) PG=17 brew install postgresql@${PG} gettext unzip kham_pg-0.6.0.zip cd kham_pg-0.6.0 PG_CONFIG=$(brew --prefix postgresql@${PG})/bin/pg_config make install psql -c "CREATE EXTENSION kham_pg;" ``` --- ## Token types ```sql SELECT * FROM ts_token_type('kham'); -- 1 thai Thai word -- 2 latin Latin script token -- 3 number Numeric token -- 4 punct Punctuation -- 5 emoji Emoji token -- 6 unknown Unknown / OOV token -- 7 named Named entity token (person, place, organisation) ``` ## Basic usage ```sql -- Inspect how the parser splits Thai text SELECT * FROM ts_parse('kham', 'กินข้าวกับปลา'); -- 1 กินข้าว -- 1 กับ -- 1 ปลา -- Build a tsvector — Thai tokens expand to [word, soundex, rtgs] SELECT to_tsvector('kham', 'กินข้าวกับปลา'); -- '1400':2 '1619':1 '4800':3 'kap':2 'pla':3 'กับ':2 'กินข้าว':1 'ปลา':3 -- Full-text search SELECT title FROM articles WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'ปลา'); ``` ## Phonetic search Thai/Named tokens are automatically expanded with their lk82 Soundex code. Near-homophones share a code and match each other without any extra schema work. ```sql -- Match any word with the same lk82 code as ปลา (4800) SELECT title FROM articles WHERE to_tsvector('kham', body) @@ to_tsquery('kham', '4800'); ``` ## RTGS romanization search Thai/Named tokens are also expanded with their RTGS romanized form. Latin-script queries match Thai documents automatically. ```sql SELECT title FROM articles WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'pla'); -- matches documents containing ปลา ``` ## Named entity search ```sql SELECT * FROM ts_parse('kham', 'ทักษิณเดินทางไปกรุงเทพ'); -- 7 ทักษิณ ← Named: Person -- 1 เดิน -- 1 ทาง -- 1 ไป -- 7 กรุงเทพ ← Named: Place ``` ## ts_headline ```sql SELECT ts_headline('kham', body, plainto_tsquery('kham', 'ปลา')) FROM articles; -- …กินปลากับข้าว… -- Custom markers SELECT ts_headline( 'kham', body, plainto_tsquery('kham', 'ปลา'), 'StartSel=<<<, StopSel=>>>' ) FROM articles; ``` ## GIN index ```sql CREATE INDEX articles_fts_idx ON articles USING GIN (to_tsvector('kham', body)); -- Query uses the index automatically SELECT title FROM articles WHERE to_tsvector('kham', body) @@ plainto_tsquery('kham', 'ปลา') ORDER BY ts_rank(to_tsvector('kham', body), plainto_tsquery('kham', 'ปลา')) DESC; ``` ## Upgrade If you are upgrading from a previous version: ```sql ALTER EXTENSION kham_pg UPDATE; ``` ## License MIT OR Apache-2.0 ## Links - Source: - Releases: - Issues: - PGXN: