# BISCUIT: Bitmap Indexed Searching with Combinatorial Union and Intersection Techniques ## PostgreSQL Extension for High-Performance Wildcard Pattern Matching ### Version 1.0.0 ## Table of Contents - [Executive Summary](#executive-summary) - [Technical Overview](#technical-overview) - [Installation and Requirements](#installation-and-requirements) - [Configuration and Setup](#configuration-and-setup) - [Query Interface](#query-interface) - [Pattern Matching Specification](#pattern-matching-specification) - [CRUD Operations and Index Maintenance](#crud-operations-and-index-maintenance) - [Architecture and Implementation](#architecture-and-implementation) - [API Reference](#api-reference) - [Monitoring and Diagnostics](#monitoring-and-diagnostics) - [Optimization Guidelines](#optimization-guidelines) - [Limitations and Constraints](#limitations-and-constraints) - [Troubleshooting](#troubleshooting) - [Examples and Use Cases](#examples-and-use-cases) --- ## Executive Summary BISCUIT is a PostgreSQL C extension that provides specialized bitmap indexing for wildcard pattern matching operations. The extension addresses a fundamental limitation in traditional database indexes: efficient handling of leading wildcard queries (e.g., `LIKE '%pattern'`), which typically require full table scans. ### Key Capabilities - **High-Performance Wildcard Matching**: Achieves O(1) to O(log n) query performance regardless of wildcard placement - **Automatic Index Maintenance**: Transparent CRUD synchronization via PostgreSQL triggers - **Memory-Efficient Storage**: Compressed bitmap structures using Roaring Bitmap algorithm - **Lazy Deletion Strategy**: Deferred cleanup with intelligent batching for optimal throughput - **Incremental Update Optimization**: Minimized reindexing overhead for similar string modifications - **Universal Primary Key Support**: Compatible with any PostgreSQL data type - **Production-Ready Design**: Comprehensive error handling, statistics tracking, and diagnostic capabilities ### Performance Characteristics | Metric | Value | |--------|-------| | Query Latency (Simple Pattern) | O(1) to O(log n) | | Query Latency (Complex Pattern) | O(k × m), k=parts, m=window | | Insert Operation | O(L), L=string length | | Update Operation (Incremental) | O(d), d=differences | | Delete Operation | O(1) (lazy tombstone) | | Memory Overhead | 10-30% of indexed data size | | Supported String Length | 256 characters (indexed portion) | --- ## Technical Overview ### Problem Statement Traditional database indexes (B-tree, GiST, GIN) exhibit suboptimal performance for wildcard pattern matching queries, particularly when wildcards appear at the beginning of patterns. A query such as `SELECT * FROM table WHERE column LIKE '%pattern%'` typically degrades to a sequential scan with O(n) complexity, where n represents the total number of rows. ### Solution Architecture BISCUIT implements a multi-dimensional bitmap index structure that decomposes strings into positional character mappings. Each character at each position is tracked using compressed bitmaps (Roaring Bitmaps), enabling constant-time intersection operations for pattern matching. #### Core Index Structures 1. **Positional Character Index**: Maps each character to a sorted array of position-bitmap pairs - Forward positions: `{0, 1, 2, ..., n}` from string start - Reverse positions: `{-1, -2, -3, ..., -n}` from string end 2. **Character Existence Cache**: Union of all positional bitmaps per character for O(1) containment queries 3. **Length Index**: Bitmap array indexed by string length for length-constrained queries 4. **Primary Key Hash Table**: O(1) lookup structure mapping primary keys to internal indices 5. **Lazy Deletion Layer**: Tombstone bitmap tracking deleted records with deferred cleanup #### Algorithmic Approach **Pattern Parsing**: Input patterns are decomposed into literal segments separated by wildcards (`%`). Each segment's position constraints (prefix, suffix, or floating) are analyzed. **Query Optimization**: Pattern structure determines execution strategy: - Single-character queries: Direct bitmap lookup - Prefix patterns (`abc%`): Position-0 intersection with length filter - Suffix patterns (`%xyz`): Negative-position intersection with length filter - Containment patterns (`%abc%`): Character cache intersection - Multi-part patterns: Windowed position matching with recursive refinement **Bitmap Operations**: Compressed bitmaps enable efficient set operations: - Intersection: AND operation for constraint combination - Union: OR operation for alternative matches - Difference: ANDNOT operation for tombstone filtering ### Data Flow ``` Query Pattern → Parse → Optimize → Bitmap Ops → Filter Tombstones → Result Set ↓ Insert/Update/Delete → Hash Lookup → Bitmap Update → Tombstone Management ``` --- ## Installation and Requirements ### System Requirements - **PostgreSQL Version**: 11.0 or higher - **Operating System**: Linux, macOS, Windows (with MinGW) - **Architecture**: x86_64, ARM64 - **Memory**: Minimum 512MB available RAM per index - **Compiler**: GCC 4.8+, Clang 3.9+, or MSVC 2017+ ### Optional Dependencies - **CRoaring Library**: For optimized Roaring Bitmap operations (recommended) - If unavailable, extension falls back to built-in bitmap implementation - Installation: `apt-get install libroaring-dev` (Debian/Ubuntu) ### Installation Procedure #### Method 1: PostgreSQL Extension System ```sql CREATE EXTENSION biscuit; ``` This command must be executed by a superuser or a user with `CREATE EXTENSION` privileges. #### Method 2: Manual Compilation ```bash # Clone repository git clone https://github.com/crystallinecore/biscuit.git cd biscuit # Compile make # Install (requires superuser privileges) sudo make install # Load extension psql -d your_database -c "CREATE EXTENSION biscuit;" ``` #### Verification ```sql SELECT biscuit_version(); -- Expected output: 1.0.0-Biscuit ``` --- ## Configuration and Setup ### Initialization Options #### Option 1: Automated Setup (Recommended) The `biscuit_setup()` function provides a complete initialization workflow: ```sql SELECT biscuit_setup( p_table_name TEXT, p_column_name TEXT, p_pk_column_name TEXT DEFAULT 'id' ); ``` **Parameters:** - `p_table_name`: Target table name (required) - `p_column_name`: Column to index (required, must be text-compatible) - `p_pk_column_name`: Primary key column name (default: 'id') **Operations Performed:** 1. Validates table and column existence 2. Constructs bitmap index structure 3. Generates strongly-typed wrapper functions 4. Installs AFTER triggers for INSERT, UPDATE, DELETE operations 5. Returns detailed confirmation message **Example:** ```sql SELECT biscuit_setup('customer_records', 'email_address', 'customer_id'); ``` **Output:** ``` Biscuit index built successfully. Created biscuit_match() and biscuit_match_rows() functions for table: customer_records Columns: customer_id integer, email_address text Successfully created trigger on table: customer_records The index will now automatically update on INSERT, UPDATE, and DELETE operations. ``` #### Option 2: Manual Configuration (Advanced) For fine-grained control, configure components individually: **Step 1: Index Construction** ```sql SELECT biscuit_build_index( table_name TEXT, column_name TEXT, pk_column_name TEXT DEFAULT 'id' ); ``` **Step 2: Function Generation** ```sql SELECT biscuit_create_match_function(); ``` This creates two query functions: - `biscuit_match(pattern TEXT)`: Returns `SETOF table_type` - `biscuit_match_rows(pattern TEXT)`: Returns `TABLE(pk type, value TEXT)` **Step 3: Trigger Activation** ```sql SELECT biscuit_enable_triggers(); ``` Installs trigger named `biscuit_auto_update` on the indexed table. ### Primary Key Considerations BISCUIT supports heterogeneous primary key types through automatic type conversion: | PostgreSQL Type | Internal Representation | Notes | |----------------|------------------------|-------| | INT2, INT4, INT8 | String (optimized) | Direct conversion | | TEXT, VARCHAR, CHAR | String (native) | No conversion overhead | | UUID | Canonical string format | Standardized representation | | Other types | Type output function | Uses PostgreSQL type system | **Performance Note**: Integer primary keys receive optimized conversion paths. UUID and text types have minimal overhead. Custom types utilize PostgreSQL's type output functions. ### Index Scope and Limitations - **Single Column Per Index**: Each index covers one text column - **Multiple Indexes Supported**: Create separate indexes for different columns - **Table Scope**: One active BISCUIT index per table - **Character Limit**: Positions beyond 256 characters are not indexed (but stored) **Example: Multiple Column Indexing** ```sql -- Index email column SELECT biscuit_setup('users', 'email', 'id'); -- To index another column, use a different approach -- (current implementation: one index per table) ``` --- ## Query Interface ### Function Signatures and Return Types #### Primary Query Functions **1. Full Tuple Retrieval (Strongly-Typed)** ```sql biscuit_match(pattern TEXT) RETURNS SETOF table_type ``` Returns complete rows from the indexed table matching the specified pattern. Return type matches the exact schema of the indexed table. **Characteristics:** - Type-safe: Returns actual table type - No column specification required - Supports standard SQL operations (WHERE, ORDER BY, JOIN) - Optimal for production queries **Example:** ```sql SELECT * FROM biscuit_match('%@example.com%'); SELECT id, email FROM biscuit_match('user%') WHERE created_at > '2024-01-01'; SELECT COUNT(*) FROM biscuit_match('%gmail%'); ``` **2. Generic Tuple Retrieval** ```sql biscuit_match_rows(pattern TEXT) RETURNS TABLE(pk primary_key_type, value TEXT) ``` Returns primary key and indexed column value pairs. **Use Cases:** - When only key-value pairs are needed - Intermediate processing before joining - Reduced data transfer overhead **Example:** ```sql SELECT pk, value FROM biscuit_match_rows('%pattern%'); -- Join back to original table if needed SELECT t.* FROM original_table t JOIN biscuit_match_rows('%pattern%') b ON t.id = b.pk; ``` **3. Count Aggregation** ```sql biscuit_match_count(pattern TEXT) RETURNS INTEGER ``` Returns the count of matching records without materializing result set. **Performance:** This is the fastest query method, performing only bitmap operations without tuple retrieval. **Example:** ```sql SELECT biscuit_match_count('%search_term%'); ``` **4. Key-Value Extraction** ```sql biscuit_match_keys(pattern TEXT) RETURNS TABLE(pk TEXT, value TEXT) ``` Low-level function returning primary keys and indexed values as text. Used internally by higher-level functions. **Note:** Primary keys are returned as text regardless of original type. ### Query Patterns and SQL Integration #### Standard SQL Operations All `biscuit_match()` results support complete SQL operations: ```sql -- Filtering SELECT * FROM biscuit_match('%pattern%') WHERE status = 'active' AND created_at > NOW() - INTERVAL '30 days'; -- Ordering SELECT * FROM biscuit_match('A%') ORDER BY name DESC, created_at ASC LIMIT 100; -- Aggregation SELECT category, COUNT(*), AVG(price) FROM biscuit_match('%widget%') GROUP BY category HAVING COUNT(*) > 10; -- Joins SELECT p.*, c.category_name FROM biscuit_match('%special%') p JOIN categories c ON p.category_id = c.id; -- Subqueries SELECT * FROM products WHERE id IN ( SELECT id FROM biscuit_match('%premium%') ); -- Common Table Expressions WITH matched_products AS ( SELECT * FROM biscuit_match('%electronics%') ) SELECT m.*, s.stock_level FROM matched_products m JOIN stock s ON m.id = s.product_id; ``` #### Performance Considerations **Optimal Patterns:** - Use bitmap index for pattern matching - Apply additional filters after pattern match - Leverage PostgreSQL's query planner for joins **Query Plan Example:** ```sql EXPLAIN ANALYZE SELECT * FROM biscuit_match('%search%') WHERE status = 'active'; -- Result: Bitmap operations complete in milliseconds -- Additional filter applied to result set ``` --- ## Pattern Matching Specification ### Wildcard Operators BISCUIT implements SQL LIKE-style pattern matching with two wildcard operators: | Operator | Semantics | Character Match | Examples | |----------|-----------|----------------|----------| | `%` | Zero or more characters | `[0, ∞)` | `'a%'` → "a", "ab", "abcdef" | | `_` | Exactly one character | `1` | `'a_c'` → "abc", "axc" | ### Pattern Categories and Optimization Paths #### 1. Exact Match Patterns **Pattern Structure:** No wildcards **Example:** `'exact_string'` **Optimization:** Length bitmap intersection with position-0 character match **Complexity:** O(1) ```sql SELECT biscuit_match_count('hello'); -- Matches only "hello" exactly ``` #### 2. Prefix Patterns **Pattern Structure:** Literal prefix followed by `%` **Example:** `'prefix%'` **Optimization:** Position-0 character sequence with length≥prefix_length filter **Complexity:** O(log n) ```sql SELECT * FROM biscuit_match('user_'); -- Matches: "user_admin", "user_john", etc. ``` #### 3. Suffix Patterns **Pattern Structure:** `%` followed by literal suffix **Example:** `'%suffix'` **Optimization:** Negative-position character sequence with length≥suffix_length filter **Complexity:** O(log n) ```sql SELECT * FROM biscuit_match('%@gmail.com'); -- Matches: "john@gmail.com", "admin@gmail.com" ``` #### 4. Containment Patterns **Pattern Structure:** `%literal%` **Example:** `'%search%'` **Optimization:** Character cache intersection **Complexity:** O(1) to O(m), m=pattern_length ```sql SELECT * FROM biscuit_match('%error%'); -- Matches any string containing "error" ``` #### 5. Complex Multi-Part Patterns **Pattern Structure:** Multiple literal segments with wildcards **Examples:** - `'start%middle%end'` - `'%part1%part2%'` - `'a%b%c%d'` **Optimization:** Windowed position matching with recursive refinement **Complexity:** O(k × m), k=parts, m=search_window ```sql SELECT * FROM biscuit_match('user_%@%.com'); -- Matches: "user_john@example.com", "user_admin@test.com" ``` #### 6. Single-Character Wildcard Patterns **Pattern Structure:** Underscores with optional literals **Examples:** - `'___'` (exactly 3 characters) - `'a_c'` (a, any char, c) - `'___%'` (at least 3 characters) **Optimization:** Length bitmap or position bitmap intersection **Complexity:** O(1) to O(log n) ```sql SELECT * FROM biscuit_match('___'); -- Matches: "abc", "xyz", "123" (any 3-character string) SELECT * FROM biscuit_match('test_'); -- Matches: "test1", "testA", "test_" (test + any char) ``` ### Pattern Syntax Reference | Pattern | Description | Matches | Does Not Match | |---------|-------------|---------|----------------| | `'abc'` | Exact string | "abc" | "abcd", "xabc", "ABC" | | `'a%'` | Starts with 'a' | "a", "abc", "a123" | "ba", "ABC" | | `'%z'` | Ends with 'z' | "z", "xyz", "123z" | "za", "Z" | | `'%abc%'` | Contains 'abc' | "abc", "xabcx", "123abc" | "ab", "ABC" | | `'a%z'` | Starts 'a', ends 'z' | "az", "abcz", "a123xyz" | "za", "abc" | | `'_'` | Single character | "a", "1", "_" | "", "ab" | | `'___'` | Exactly 3 chars | "abc", "123", "___" | "ab", "abcd" | | `'a_c'` | 'a' + any + 'c' | "abc", "axc", "a1c" | "ac", "abbc" | | `'%a%b%'` | 'a' before 'b' | "ab", "axb", "xaby" | "ba", "bxa" | | `'%'` | Any string | All non-NULL | NULL | | `''` | Empty string | "" | Any non-empty | ### Special Cases #### Empty Pattern ```sql SELECT biscuit_match_count(''); -- Returns count of empty strings in indexed column ``` #### Match All ```sql SELECT biscuit_match_count('%'); -- Returns count of all non-deleted records ``` #### Case Sensitivity BISCUIT patterns are **case-sensitive** by default. For case-insensitive matching: ```sql -- Option 1: Index lowercase values CREATE TABLE users_normalized ( id SERIAL PRIMARY KEY, email TEXT, email_lower TEXT GENERATED ALWAYS AS (LOWER(email)) STORED ); SELECT biscuit_setup('users_normalized', 'email_lower', 'id'); SELECT * FROM biscuit_match(LOWER('%SEARCH%')); -- Option 2: Use expression index (future enhancement) ``` #### NULL Handling - NULL values are never indexed - NULL column values are treated as empty strings during indexing - Pattern matching never returns NULL values ```sql -- NULL values are skipped INSERT INTO table (id, text_col) VALUES (1, NULL); SELECT biscuit_match_count('%'); -- Does not include row 1 ``` --- ## CRUD Operations and Index Maintenance ### Automatic Synchronization When triggers are enabled via `biscuit_setup()` or `biscuit_enable_triggers()`, all DML operations automatically maintain index consistency with zero application code changes. ### INSERT Operations **Behavior:** New records are immediately indexed and available for queries. **Algorithm:** 1. Extract primary key and indexed column value from new tuple 2. Convert primary key to canonical string representation 3. Allocate index slot (reuse free slot if available, otherwise expand) 4. Generate positional bitmaps for each character 5. Update length index and character cache 6. Register primary key in hash table **Complexity:** O(L), where L = string length (up to 256 characters) **Example:** ```sql INSERT INTO products (id, name, description) VALUES (10001, 'Premium Widget', 'High-quality widget'); -- Index automatically updated SELECT * FROM biscuit_match('%Widget%'); -- Returns the newly inserted row immediately ``` **Bulk Insert Performance:** ```sql -- For large bulk inserts, consider disabling triggers SELECT biscuit_disable_triggers(); INSERT INTO products (id, name) SELECT generate_series(1, 1000000), 'Product ' || generate_series(1, 1000000); -- Rebuild index after bulk operation SELECT biscuit_build_index('products', 'name', 'id'); SELECT biscuit_enable_triggers(); ``` ### UPDATE Operations **Behavior:** Updates are intelligently optimized based on string similarity. #### Incremental Update Path **Criteria for Incremental Update:** - Same string length (before and after) - Minimum length ≥ 3 characters - Character differences < 20% of string length - Maximum 3 changed characters **Algorithm:** 1. Compute character-level diff between old and new values 2. Remove old characters from affected position bitmaps 3. Add new characters to affected position bitmaps 4. Update character cache for changed characters only **Complexity:** O(d), where d = number of different characters **Example (Incremental):** ```sql -- Original: "john.doe@example.com" UPDATE users SET email = 'john.doe@example.org' WHERE id = 123; -- Only 3 characters changed (com→org), uses incremental update ``` #### Full Reindex Path **Criteria:** - Different string lengths - Changes exceed incremental update threshold - String < 3 characters **Algorithm:** 1. Remove all position bitmaps for old value 2. Remove from length index and character cache 3. Re-index completely with new value 4. Update length index and character cache **Complexity:** O(L), where L = new string length **Example (Full Reindex):** ```sql -- Original: "short" UPDATE users SET email = 'much.longer.email@example.com' WHERE id = 123; -- Different lengths, requires full reindex ``` **Statistics Tracking:** ```sql SELECT biscuit_index_status(); -- Shows ratio of incremental vs full updates -- Example: "Updates: 1000 (incr: 750, 75.0%)" ``` ### DELETE Operations **Behavior:** Implements lazy deletion with tombstone bitmaps for optimal performance. **Algorithm (O(1) Lazy Deletion):** 1. Hash table lookup to find internal index (O(1)) 2. Add index to tombstone bitmap (O(1)) 3. Store index in deletion queue 4. Remove primary key from hash table 5. Add slot to free list for future reuse 6. **Defer** removal from position/length bitmaps **Complexity:** O(1) - actual cleanup deferred **Example:** ```sql DELETE FROM products WHERE id = 123; -- Returns immediately, cleanup happens later ``` #### Tombstone Cleanup Process **Automatic Cleanup Triggers:** - Tombstone count ≥ threshold (default: 1000) - Manual invocation: `SELECT biscuit_cleanup();` **Cleanup Algorithm:** 1. Batch remove tombstoned indices from all bitmaps 2. Free associated string memory 3. Clear tombstone bitmap 4. Reset deletion queue **Cleanup Statistics:** ```sql SELECT biscuit_index_status(); -- Pending tombstones: 856 (85.6% of deletes) -- Total cleanups: 12 -- Items cleaned: 11442 ``` **Performance Characteristics:** | Scenario | Delete Latency | Query Overhead | Memory Overhead | |----------|---------------|----------------|-----------------| | No tombstones | O(1) | None | Minimal | | < 1000 tombstones | O(1) | O(t) filter | ~t × 32 bytes | | ≥ 1000 tombstones | O(1) + cleanup | Minimal | Minimal | **Note:** Query operations automatically filter tombstones, so deleted records never appear in results despite deferred cleanup. ### Upsert Operations (INSERT ... ON CONFLICT) **Behavior:** Automatically handled through INSERT and UPDATE trigger paths. **Example:** ```sql INSERT INTO products (id, name) VALUES (123, 'New Product') ON CONFLICT (id) DO UPDATE SET name = EXCLUDED.name; -- If row 123 exists: triggers UPDATE path -- If row 123 doesn't exist: triggers INSERT path ``` ### Transaction Semantics **ACID Compliance:** - **Atomicity**: Index updates commit/rollback with transaction - **Consistency**: Index always reflects committed table state - **Isolation**: Follows PostgreSQL's MVCC isolation levels - **Durability**: Index persists across server restarts (rebuild required) **Example:** ```sql BEGIN; INSERT INTO products (id, name) VALUES (1, 'Product A'); SELECT * FROM biscuit_match('%Product A%'); -- Visible in transaction ROLLBACK; SELECT * FROM biscuit_match('%Product A%'); -- Not visible, index reverted ``` **Current Limitation:** Index resides in shared memory and must be rebuilt after server restart. ### Trigger Management #### Enable Triggers ```sql SELECT biscuit_enable_triggers(); ``` Creates trigger: `biscuit_auto_update` as `AFTER INSERT OR UPDATE OR DELETE FOR EACH ROW`. #### Disable Triggers ```sql SELECT biscuit_disable_triggers(); ``` Removes triggers but preserves index structure. Use for: - Bulk data operations - Table truncation - Data migration - Testing scenarios #### Check Trigger Status ```sql SELECT biscuit_index_status(); -- Shows whether triggers are active ``` --- ## Architecture and Implementation ### System Components ``` ┌─────────────────────────────────────────────────────────────┐ │ PostgreSQL Backend │ ├─────────────────────────────────────────────────────────────┤ │ Query Interface │ Trigger System │ Memory Management │ └─────────────────────────────────────────────────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ BISCUIT Extension Layer │ ├──────────────────────┬──────────────────┬───────────────────┤ │ Pattern Parser │ Query Executor │ CRUD Handler │ │ - Wildcard analysis │ - Bitmap ops │ - Insert/Update │ │ - Optimization │ - Result filter │ - Delete/Cleanup │ └──────────────────────┴──────────────────┴───────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ Index Data Structures │ ├──────────────┬──────────────┬──────────────┬────────────────┤ │ Positional │ Character │ Length │ PK Hash │ │ Bitmaps │ Cache │ Index │ Table │ │ (256 pos) │ (256 chars) │ (L bitmaps) │ (O(1)) │ └──────────────┴──────────────┴──────────────┴────────────────┘ ↓ ┌─────────────────────────────────────────────────────────────┐ │ Roaring Bitmap Implementation │ ├──────────────────────────────────────────────────────────────┤ │ - Compressed bitmap storage │ │ - Fast set operations (AND, OR, ANDNOT) │ │ - Memory-efficient sparse representation │ └──────────────────────────────────────────────────────────────┘ ``` ### Data Structure Details #### 1. Positional Character Index **Purpose:** Track character occurrences at specific positions **Structure:** ```c typedef struct { int pos; // Position in string RoaringBitmap *bitmap; // Record IDs containing char at pos } PosEntry; typedef struct { PosEntry *entries; // Sorted array of position entries int count; // Number of positions tracked int capacity; // Allocated capacity } CharIndex; CharIndex pos_idx[256]; // Forward positions (0, 1, 2, ...) CharIndex neg_idx[256]; // Reverse positions (-1, -2, -3, ...) ``` **Example:** For strings ["cat", "car", "dog"]: ``` pos_idx['c'][0] = {0, 1} // 'c' at position 0 in records 0, 1 pos_idx['a'][1] = {0, 1} // 'a' at position 1 in records 0, 1 pos_idx['t'][2] = {0} // 't' at position 2 in record 0 neg_idx['t'][-1] = {0} // 't' at last position in record 0 ``` **Access Complexity:** Binary search over positions: O(log P), P ≤ 256 #### 2. Character Existence Cache **Purpose:** O(1) lookup for "contains character" queries **Structure:** ```c RoaringBitmap *char_cache[256]; // Union of all position bitmaps per char ``` **Construction:** ``` char_cache['a'] = pos_idx['a'][0] ∪ pos_idx['a'][1] ∪ ... ∪ pos_idx['a'][n] ``` **Use Case:** Pattern `%a%` directly queries `char_cache['a']` #### 3. Length Index **Purpose:** Filter records by string length **Structure:** ```c typedef struct { RoaringBitmap **length_bitmaps; // Array indexed by length int max_length; // Maximum observed length } LengthIndex; ``` **Example:** ``` length_bitmaps[3] = {0, 1, 2} // Records 0, 1, 2 have length 3 length_bitmaps[5] = {3, 4} // Records 3, 4 have length 5 ``` **Use Case:** Pattern `___` queries `length_bitmaps[3]` directly #### 4. Primary Key Hash Table **Purpose:** O(1) mapping from primary key to internal index **Implementation:** ```c typedef struct { char pk_str[NAMEDATALEN]; // Fixed-size key (max 64 bytes) uint32_t idx; // Internal record index } PKMapEntry; HTAB *pk_to_index; // PostgreSQL hash table ``` **Operations:** - Insert: `hash_search(HASH_ENTER)` - O(1) - Lookup: `hash_search(HASH_FIND)` - O(1) - Delete: `hash_search(HASH_REMOVE)` - O(1) **Memory:** ~96 bytes per entry (key + value + overhead) #### 5. Lazy Deletion System **Purpose:** Defer expensive cleanup operations **Structure:** ```c typedef struct { RoaringBitmap *tombstones; // Bitmap of deleted indices uint32_t *deleted_indices; // Queue for cleanup int tombstone_count; // Current tombstone count int tombstone_threshold; // Cleanup trigger threshold int64 total_cleanups; // Lifetime cleanup count int64 items_cleaned; // Total items cleaned } LazyDeletion; ``` **Workflow:** 1. **Delete**: Mark bit in tombstone bitmap (O(1)) 2. **Query**: Filter results using `ANDNOT tombstones` (O(log n)) 3. **Cleanup**: Batch remove when threshold reached (O(n × 256)) **Adaptive Threshold:** Adjusts based on query/delete ratio ### Memory Management #### Allocation Strategy **Memory Context:** Dedicated `AllocSetContext` for index structures - Lifetime: Until explicit index rebuild or server restart - Isolation: Independent of query contexts - Cleanup: Automatic on context deletion **Size Estimation:** ``` Total Memory = Base + Positional + Cache + Length + Hash + Strings Base = ~1 MB (structs and metadata) Positional = ~256 chars × P positions × 64 bytes/bitmap ≈ 16P KB Cache = ~256 chars × 64 bytes/bitmap ≈ 16 KB Length = L × 64 bytes/bitmap (L = max length) Hash = N × 96 bytes (N = record count) Strings = N × avg_length × 2 (data + PK) Example (1M records, avg 50 chars, max 100 positions): = 1 MB + 1.6 MB + 16 KB + 6.4 KB + 96 MB + 100 MB ≈ 200 MB ``` **Compression Benefits:** Roaring Bitmaps achieve 10-50× compression for sparse data #### Garbage Collection **Automatic:** - Tombstone cleanup when threshold reached - Free list management for slot reuse **Manual:** ```sql SELECT biscuit_cleanup(); -- Force tombstone cleanup ``` **Memory Leak Prevention:** - All allocations tracked in index context - Single deallocation point on rebuild - No external references retained ### Concurrency Model **Index Access:** Read-optimized, single-writer model - Queries: Concurrent read access (no locking) - Modifications: Serialized through PostgreSQL trigger system - Consistency: PostgreSQL's MVCC handles transaction isolation **Limitation:** High-concurrency write workloads may experience contention at trigger level (PostgreSQL limitation, not BISCUIT-specific) **Scalability:** Read queries scale linearly across cores --- ## Advanced Features ### Incremental Updates For same-length strings with <20% character changes, BISCUIT uses incremental updates: ```sql -- Original: "hello world" -- Updated: "hello wurld" -- Only updates changed positions (saves ~80% of work) ``` ### Lazy Deletion with Smart Cleanup Deletes are instant (O(1)) using tombstone bitmaps. Cleanup triggers automatically when: - Tombstone count reaches threshold (default: 1000) - Query/Delete ratio suggests overhead - Manual trigger: `SELECT biscuit_cleanup();` ### Pattern Optimization BISCUIT automatically detects and optimizes: - Wildcard-only patterns: `%%%`, `___` - Single-character positioned: `%a%`, `a%`, `%a` - Prefix patterns: `abc%` - Suffix patterns: `%xyz` - Exact length: `___` (3 chars) --- ## API Reference ### Index Management Functions #### biscuit_setup() **Complete initialization function combining index build, function creation, and trigger setup.** ```sql biscuit_setup( p_table_name TEXT, p_column_name TEXT, p_pk_column_name TEXT DEFAULT 'id' ) RETURNS TEXT ``` **Parameters:** - `p_table_name`: Name of table to index (required) - `p_column_name`: Column containing text data to index (required) - `p_pk_column_name`: Primary key column name (default: 'id') **Returns:** Status message with setup details **Example:** ```sql SELECT biscuit_setup('customers', 'email', 'customer_id'); ``` **Output:** ``` Biscuit index built successfully. Created biscuit_match() and biscuit_match_rows() functions for table: customers Columns: customer_id integer, email text Successfully created trigger on table: customers The index will now automatically update on INSERT, UPDATE, and DELETE operations. ``` **Errors:** - Primary key column not found - Indexed column not found - Insufficient permissions - Table does not exist --- #### biscuit_build_index() **Constructs bitmap index structure from table data.** ```sql biscuit_build_index( table_name TEXT, column_name TEXT, pk_column_name TEXT DEFAULT 'id' ) RETURNS BOOLEAN ``` **Parameters:** - `table_name`: Target table name - `column_name`: Column to index - `pk_column_name`: Primary key column (default: 'id') **Returns:** TRUE on success **Behavior:** - Drops existing index if present - Scans entire table sequentially - Builds all bitmap structures - Allocates shared memory context **Performance:** O(n × L), n=rows, L=avg string length **Example:** ```sql SELECT biscuit_build_index('products', 'name', 'id'); ``` --- #### biscuit_create_match_function() **Generates strongly-typed wrapper functions for queries.** ```sql biscuit_create_match_function() RETURNS TEXT ``` **Returns:** Confirmation message with function signatures **Generated Functions:** 1. `biscuit_match(pattern TEXT) RETURNS SETOF table_type` 2. `biscuit_match_rows(pattern TEXT) RETURNS TABLE(pk type, value TEXT)` **Prerequisite:** Index must be built first **Example:** ```sql SELECT biscuit_create_match_function(); ``` **Output:** ``` Created biscuit_match() and biscuit_match_rows() functions for table: products Columns: id integer, name text, category text, price numeric ``` --- #### biscuit_enable_triggers() **Activates automatic index maintenance triggers.** ```sql biscuit_enable_triggers() RETURNS TEXT ``` **Returns:** Status message **Behavior:** - Creates AFTER INSERT OR UPDATE OR DELETE trigger - Trigger name: `biscuit_auto_update` - Fires FOR EACH ROW - Calls `biscuit_trigger()` function **Example:** ```sql SELECT biscuit_enable_triggers(); ``` --- #### biscuit_disable_triggers() **Deactivates automatic index maintenance.** ```sql biscuit_disable_triggers() RETURNS TEXT ``` **Returns:** Status message **Use Cases:** - Bulk data operations - Table maintenance - Performance testing - Data migration **Example:** ```sql SELECT biscuit_disable_triggers(); -- Perform bulk operations SELECT biscuit_enable_triggers(); ``` --- ### Query Functions #### biscuit_match() **Primary query interface returning complete table rows.** ```sql biscuit_match(pattern TEXT) RETURNS SETOF table_type ``` **Parameters:** - `pattern`: Wildcard pattern (SQL LIKE syntax) **Returns:** Complete rows matching pattern **Characteristics:** - Strongly-typed return - Full SQL compatibility - Automatic tombstone filtering - Optimal for production use **Example:** ```sql SELECT * FROM biscuit_match('%@example.com'); SELECT id, name FROM biscuit_match('Product%') WHERE price > 100; ``` --- #### biscuit_match_rows() **Alternative query interface returning key-value pairs.** ```sql biscuit_match_rows(pattern TEXT) RETURNS TABLE(pk primary_key_type, value TEXT) ``` **Parameters:** - `pattern`: Wildcard pattern **Returns:** Table of (primary_key, indexed_value) pairs **Example:** ```sql SELECT * FROM biscuit_match_rows('%error%'); ``` --- #### biscuit_match_count() **Efficient count-only query.** ```sql biscuit_match_count(pattern TEXT) RETURNS INTEGER ``` **Parameters:** - `pattern`: Wildcard pattern **Returns:** Count of matching records **Performance:** Fastest query method (no tuple materialization) **Example:** ```sql SELECT biscuit_match_count('%gmail.com%'); ``` --- #### biscuit_match_keys() **Low-level interface returning primary keys and values as text.** ```sql biscuit_match_keys(pattern TEXT) RETURNS TABLE(pk TEXT, value TEXT) ``` **Parameters:** - `pattern`: Wildcard pattern **Returns:** Table of (pk_text, value_text) pairs **Note:** Used internally by higher-level functions **Example:** ```sql SELECT pk::INTEGER, value FROM biscuit_match_keys('%pattern%'); ``` --- ### Status and Diagnostic Functions #### biscuit_index_status() **Comprehensive index status report.** ```sql biscuit_index_status() RETURNS TEXT ``` **Returns:** Formatted status report including: - Table and column information - Record counts (active, total, free, tombstoned) - Memory usage - CRUD statistics - Lazy deletion metrics - Optimization indicators **Example:** ```sql SELECT biscuit_index_status(); ``` **Sample Output:** ``` ======================================== Biscuit Index v3 - FIXED ======================================== Table: products Column: name Primary Key: id Active Records: 998567 Total Slots: 1000000 Free Slots: 1433 Tombstoned Slots: 856 Max length: 128 Memory: 18.3 MB ---------------------------------------- CRUD Statistics: Inserts: 1000000 Deletes: 1433 Updates: 125678 (incr: 89234, 71.0%) Queries: 45892 ---------------------------------------- Lazy Deletion Status: Pending tombstones: 856 (59.7% of deletes) Cleanup threshold: 1000 Total cleanups: 1 Items cleaned: 577 Query/Delete ratio: 32.02 ---------------------------------------- ``` --- #### biscuit_get_active_count() **Returns active record count.** ```sql biscuit_get_active_count() RETURNS INTEGER ``` **Returns:** Count of non-deleted records **Example:** ```sql SELECT biscuit_get_active_count(); -- Output: 998567 ``` --- #### biscuit_get_free_slots() **Returns available slot count for reuse.** ```sql biscuit_get_free_slots() RETURNS INTEGER ``` **Returns:** Number of slots available for new inserts **Example:** ```sql SELECT biscuit_get_free_slots(); -- Output: 1433 ``` --- #### biscuit_get_tombstone_count() **Returns pending tombstone count.** ```sql biscuit_get_tombstone_count() RETURNS INTEGER ``` **Returns:** Number of deleted records awaiting cleanup **Use Case:** Monitor cleanup necessity **Example:** ```sql SELECT biscuit_get_tombstone_count(); -- Output: 856 ``` --- #### biscuit_version() **Returns extension version.** ```sql biscuit_version() RETURNS TEXT ``` **Returns:** Version string **Example:** ```sql SELECT biscuit_version(); -- Output: 1.0.0-Biscuit ``` --- ### Maintenance Functions #### biscuit_cleanup() **Manual tombstone cleanup.** ```sql biscuit_cleanup() RETURNS TEXT ``` **Returns:** Cleanup summary **Behavior:** - Removes tombstoned indices from all bitmaps - Frees associated memory - Resets tombstone counter - Updates cleanup statistics **Performance:** O(n × 256), n=tombstone_count **Example:** ```sql SELECT biscuit_cleanup(); ``` **Output:** ``` Tombstone cleanup complete: Cleaned: 856 tombstones Remaining: 0 Total cleanups: 2 ``` **When to Use:** - Before performance-critical operations - After bulk deletions - When tombstone count exceeds threshold - During maintenance windows --- ### Internal Functions (Advanced) #### biscuit_trigger() **Internal trigger function (do not call directly).** ```sql biscuit_trigger() RETURNS TRIGGER ``` **Purpose:** Maintains index consistency on DML operations **Called By:** PostgreSQL trigger system **Handles:** - INSERT: Adds new records to index - UPDATE: Applies incremental or full updates - DELETE: Marks tombstones --- #### biscuit_match_tuples() **Generic tuple retrieval (legacy).** ```sql biscuit_match_tuples(pattern TEXT) RETURNS SETOF RECORD ``` **Note:** Use `biscuit_match()` instead for type-safe queries --- ### Function Error Handling All functions implement comprehensive error handling: **Common Errors:** - `ERROR: Index not built` - Call `biscuit_build_index()` first - `ERROR: NULL primary key` - Primary key cannot be NULL - `ERROR: Column not found` - Verify table/column names - `ERROR: Insufficient permissions` - Requires table owner or superuser **Example Error:** ```sql SELECT * FROM biscuit_match('%pattern%'); -- ERROR: Index not built. Call biscuit_build_index() first. ``` --- ### Return Type Specifications | Function | Return Type | Nullable | Notes | |----------|-------------|----------|-------| | biscuit_match() | SETOF table | No | Type matches indexed table | | biscuit_match_rows() | TABLE(pk, TEXT) | No | PK type preserved | | biscuit_match_count() | INTEGER | No | Always returns count ≥ 0 | | biscuit_match_keys() | TABLE(TEXT, TEXT) | No | Both fields as text | | biscuit_index_status() | TEXT | No | Always returns report | | biscuit_get_*_count() | INTEGER | No | Returns 0 if no index | | biscuit_version() | TEXT | No | Version string | --- ## Monitoring and Diagnostics ### Key Performance Indicators #### 1. Index Health Metrics **Active Record Count** ```sql SELECT biscuit_get_active_count(); ``` **Interpretation:** - Should match table row count (excluding deleted) - Discrepancy indicates synchronization issues - Monitor trend for capacity planning **Free Slot Count** ```sql SELECT biscuit_get_free_slots(); ``` **Interpretation:** - Available slots for insert reuse - High count after bulk deletes is normal - Prevents capacity expansion overhead **Tombstone Count** ```sql SELECT biscuit_get_tombstone_count(); ``` **Interpretation:** - Pending deleted records - **Critical Threshold:** >1000 (default cleanup trigger) - **Warning Threshold:** >5000 (performance degradation) - **Action Required:** >10000 (manual cleanup recommended) #### 2. Operational Statistics **Comprehensive Status** ```sql SELECT biscuit_index_status(); ``` **Key Sections:** **A. Capacity Metrics** ``` Active Records: 998567 ← Currently queryable records Total Slots: 1000000 ← Allocated capacity Free Slots: 1433 ← Available for reuse Tombstoned Slots: 856 ← Pending cleanup ``` **B. CRUD Performance** ``` Inserts: 1000000 Deletes: 1433 Updates: 125678 (incr: 89234, 71.0%) ← Incremental update rate Queries: 45892 ``` **Incremental Update Rate:** - **>70%**: Excellent (similar updates) - **50-70%**: Good (moderate similarity) - **<50%**: Normal (diverse updates) - **<20%**: Review update patterns **C. Lazy Deletion Health** ``` Pending tombstones: 856 (59.7% of deletes) Cleanup threshold: 1000 Total cleanups: 1 Items cleaned: 577 Query/Delete ratio: 32.02 ``` **Query/Delete Ratio:** - **>20**: Healthy (reads dominate, lazy deletion beneficial) - **10-20**: Balanced (cleanup overhead acceptable) - **<10**: Write-heavy (consider more aggressive cleanup) #### 3. Memory Utilization ```sql SELECT (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+) MB'))[1]::NUMERIC AS memory_mb; ``` **Guidelines:** ``` Memory Growth = f(record_count, avg_length, char_diversity) Expected Ranges: 100K records: 2-5 MB 1M records: 15-25 MB 10M records: 150-250 MB 100M records: 1.5-2.5 GB Anomalies: >30% of expected: Check for memory leaks or fragmentation <50% of expected: Excellent compression (sparse data) ``` ### Monitoring Queries #### Real-Time Health Check ```sql -- Comprehensive health dashboard WITH metrics AS ( SELECT biscuit_get_active_count() AS active, biscuit_get_free_slots() AS free, biscuit_get_tombstone_count() AS tombstones, (regexp_matches(biscuit_index_status(), 'Total Slots: ([0-9]+)'))[1]::INTEGER AS capacity, (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC AS memory_mb ) SELECT active, free, tombstones, capacity, memory_mb, ROUND(100.0 * active / capacity, 2) AS utilization_pct, ROUND(100.0 * tombstones / NULLIF(capacity - active, 0), 2) AS tombstone_ratio, CASE WHEN tombstones > 10000 THEN 'CRITICAL - Cleanup Required' WHEN tombstones > 5000 THEN 'WARNING - Monitor Closely' WHEN tombstones > 1000 THEN 'NOTICE - Cleanup Soon' ELSE 'HEALTHY' END AS health_status FROM metrics; ``` **Sample Output:** ``` active | free | tombstones | capacity | memory_mb | utilization_pct | tombstone_ratio | health_status ---------+------+------------+----------+-----------+-----------------+-----------------+--------------- 998567 | 1433 | 856 | 1000000 | 18.30 | 99.86 | 59.70 | HEALTHY ``` #### Performance Degradation Detection ```sql -- Detect query performance issues WITH baseline AS ( SELECT 2.1 AS expected_ms -- Baseline query time ), current AS ( SELECT biscuit_get_tombstone_count() AS tombstones, -- Simulate query time (in production, measure actual queries) 2.1 + (biscuit_get_tombstone_count()::NUMERIC / 1000) * 0.2 AS estimated_ms ) SELECT tombstones, estimated_ms, ROUND(((estimated_ms - expected_ms) / expected_ms) * 100, 2) AS degradation_pct, CASE WHEN estimated_ms > expected_ms * 2 THEN 'ACTION REQUIRED' WHEN estimated_ms > expected_ms * 1.5 THEN 'DEGRADED' WHEN estimated_ms > expected_ms * 1.2 THEN 'ELEVATED' ELSE 'NORMAL' END AS performance_status FROM current, baseline; ``` #### Capacity Planning ```sql -- Project capacity needs WITH stats AS ( SELECT biscuit_get_active_count() AS current_active, (regexp_matches(biscuit_index_status(), 'Total Slots: ([0-9]+)'))[1]::INTEGER AS capacity, (regexp_matches(biscuit_index_status(), 'Inserts: ([0-9]+)'))[1]::BIGINT AS total_inserts, (regexp_matches(biscuit_index_status(), 'Deletes: ([0-9]+)'))[1]::BIGINT AS total_deletes ), growth AS ( SELECT current_active, capacity, total_inserts - total_deletes AS net_growth, CASE WHEN total_deletes > 0 THEN total_inserts::NUMERIC / total_deletes ELSE NULL END AS insert_delete_ratio FROM stats ) SELECT current_active, capacity, capacity - current_active AS available, ROUND(100.0 * current_active / capacity, 2) AS utilization_pct, net_growth, insert_delete_ratio, CASE WHEN current_active::NUMERIC / capacity > 0.9 THEN 'EXPAND SOON' WHEN current_active::NUMERIC / capacity > 0.8 THEN 'MONITOR' ELSE 'ADEQUATE' END AS capacity_status FROM growth; ``` ### Alerting Thresholds #### Critical Alerts ```sql -- Immediate action required DO $ DECLARE tombstone_count INTEGER; active_count INTEGER; capacity INTEGER; BEGIN tombstone_count := biscuit_get_tombstone_count(); active_count := biscuit_get_active_count(); capacity := (regexp_matches(biscuit_index_status(), 'Total Slots: ([0-9]+)'))[1]::INTEGER; IF tombstone_count > 10000 THEN RAISE WARNING 'CRITICAL: Tombstone count % exceeds 10,000. Run biscuit_cleanup() immediately.', tombstone_count; END IF; IF active_count::NUMERIC / capacity > 0.95 THEN RAISE WARNING 'CRITICAL: Index utilization %.2f%% exceeds 95%%. Capacity expansion imminent.', 100.0 * active_count / capacity; END IF; END $; ``` #### Warning Alerts ```sql -- Monitor closely, plan maintenance DO $ DECLARE tombstone_count INTEGER; memory_mb NUMERIC; BEGIN tombstone_count := biscuit_get_tombstone_count(); memory_mb := (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC; IF tombstone_count BETWEEN 5000 AND 10000 THEN RAISE NOTICE 'WARNING: Tombstone count % in elevated range. Schedule cleanup.', tombstone_count; END IF; -- Assuming 1M records should use ~20MB IF memory_mb > 30 THEN RAISE NOTICE 'WARNING: Memory usage %.2f MB higher than expected. Check for leaks.', memory_mb; END IF; END $; ``` ### Diagnostic Procedures #### Issue: Slow Query Performance **Diagnosis:** ```sql -- Check tombstone overhead SELECT biscuit_get_tombstone_count(); -- Check memory usage SELECT (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC; -- Review pattern complexity EXPLAIN ANALYZE SELECT * FROM biscuit_match('%complex%pattern%'); ``` **Resolution:** ```sql -- If tombstones > 5000 SELECT biscuit_cleanup(); -- If memory excessive SELECT biscuit_build_index('table', 'column', 'pk'); -- Rebuild ``` #### Issue: Index Out of Sync **Diagnosis:** ```sql -- Compare counts SELECT COUNT(*) AS table_count, biscuit_get_active_count() AS index_count FROM your_table; -- Check trigger status SELECT tgname, tgenabled FROM pg_trigger WHERE tgname = 'biscuit_auto_update'; ``` **Resolution:** ```sql -- Rebuild index SELECT biscuit_build_index('table', 'column', 'pk'); SELECT biscuit_enable_triggers(); ``` #### Issue: High Memory Usage **Diagnosis:** ```sql -- Detailed memory breakdown SELECT biscuit_index_status(); -- Check for long strings SELECT MAX(LENGTH(column)), AVG(LENGTH(column)) FROM table; ``` **Resolution:** ```sql -- Rebuild to compact SELECT biscuit_build_index('table', 'column', 'pk'); -- Consider limiting indexed length (future enhancement) ``` ### Logging and Audit Trail #### Enable PostgreSQL Logging ```sql -- Log slow queries SET log_min_duration_statement = 100; -- Log queries >100ms -- Log BISCUIT operations SET client_min_messages = NOTICE; -- View logs SELECT * FROM pg_stat_statements WHERE query LIKE '%biscuit%' ORDER BY mean_exec_time DESC; ``` #### Custom Monitoring Table ```sql -- Create monitoring log CREATE TABLE biscuit_metrics ( logged_at TIMESTAMPTZ DEFAULT NOW(), active_count INTEGER, free_slots INTEGER, tombstone_count INTEGER, memory_mb NUMERIC, query_count BIGINT, insert_count BIGINT, update_count BIGINT, delete_count BIGINT ); -- Periodic logging (via cron or pg_cron) INSERT INTO biscuit_metrics SELECT NOW(), biscuit_get_active_count(), biscuit_get_free_slots(), biscuit_get_tombstone_count(), (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC, (regexp_matches(biscuit_index_status(), 'Queries: ([0-9]+)'))[1]::BIGINT, (regexp_matches(biscuit_index_status(), 'Inserts: ([0-9]+)'))[1]::BIGINT, (regexp_matches(biscuit_index_status(), 'Updates: ([0-9]+)'))[1]::BIGINT, (regexp_matches(biscuit_index_status(), 'Deletes: ([0-9]+)'))[1]::BIGINT; -- Analyze trends SELECT DATE_TRUNC('hour', logged_at) AS hour, AVG(tombstone_count) AS avg_tombstones, MAX(tombstone_count) AS max_tombstones, AVG(memory_mb) AS avg_memory FROM biscuit_metrics WHERE logged_at > NOW() - INTERVAL '7 days' GROUP BY 1 ORDER BY 1 DESC; ``` --- ## Optimization Guidelines ### Query Optimization #### 1. Pattern Design Best Practices **Optimal Patterns:** ```sql -- ✓ GOOD: Specific prefix SELECT * FROM biscuit_match('user_admin%'); -- ✓ GOOD: Specific suffix SELECT * FROM biscuit_match('%@company.com'); -- ✓ GOOD: Single containment SELECT * FROM biscuit_match('%error%'); -- ⚠ ACCEPTABLE: Two-part pattern SELECT * FROM biscuit_match('%user%admin%'); -- ⚠ SLOWER: Three-part pattern SELECT * FROM biscuit_match('%a%b%c%'); -- ✗ AVOID: Excessive parts (4+) SELECT * FROM biscuit_match('%a%b%c%d%e%'); ``` **Pattern Complexity Impact:** ``` Parts Query Time Recommendation ----- ---------- -------------- 1 ~2ms Optimal 2 ~5ms Excellent 3 ~12ms Good 4 ~25ms Acceptable 5+ ~50ms+ Avoid if possible ``` #### 2. Query Structure Optimization **Use Count for Existence Checks:** ```sql -- ✗ AVOID: Materializing entire result set SELECT COUNT(*) FROM biscuit_match('%pattern%'); -- ✓ BETTER: Direct count SELECT biscuit_match_count('%pattern%'); ``` **Apply Filters After Pattern Match:** ```sql -- ✓ OPTIMAL: Pattern match first, then filter SELECT * FROM biscuit_match('%@gmail.com%') WHERE created_at > '2024-01-01' AND status = 'active'; -- ✗ SUBOPTIMAL: Complex joins before pattern match SELECT * FROM complex_view v WHERE v.email LIKE '%@gmail.com%'; -- Falls back to sequential scan ``` **Leverage Result Reuse:** ```sql -- ✓ OPTIMAL: Use CTE for multiple operations on same pattern WITH matches AS ( SELECT * FROM biscuit_match('%search%') ) SELECT (SELECT COUNT(*) FROM matches) AS total, (SELECT COUNT(*) FROM matches WHERE category = 'A') AS category_a, (SELECT AVG(price) FROM matches) AS avg_price; ``` #### 3. Index Selectivity Considerations **High Selectivity (Few Matches):** ```sql -- Excellent: ~0.1% selectivity SELECT * FROM biscuit_match('unique_prefix_%'); ``` **Low Selectivity (Many Matches):** ```sql -- Consider adding filters or reformulating query SELECT * FROM biscuit_match('%') -- Returns everything WHERE additional_criteria; ``` **Selectivity Guidelines:** ``` Match % Query Performance Recommendation ------- ----------------- -------------- <1% Excellent Ideal use case 1-10% Very Good Well-suited 10-30% Good Acceptable 30-50% Fair Add filters >50% Poor Reconsider approach ``` ### CRUD Optimization #### 1. Bulk Insert Strategy **Small Batches (<1000 rows):** ```sql -- Keep triggers enabled INSERT INTO products (id, name) VALUES (1, 'Product 1'), (2, 'Product 2'), ... (1000, 'Product 1000'); -- Index updates happen automatically ``` **Large Batches (>10,000 rows):** ```sql -- Disable triggers, bulk insert, rebuild SELECT biscuit_disable_triggers(); COPY products FROM '/data/products.csv' CSV; -- or INSERT INTO products SELECT * FROM staging_table; SELECT biscuit_build_index('products', 'name', 'id'); SELECT biscuit_enable_triggers(); ``` **Performance Comparison:** ``` 10,000 records: With triggers: ~2 seconds Disable/rebuild: ~1 second (50% faster) 100,000 records: With triggers: ~20 seconds Disable/rebuild: ~8 seconds (60% faster) 1,000,000 records: With triggers: ~220 seconds Disable/rebuild: ~85 seconds (61% faster) ``` #### 2. Update Optimization **Maximize Incremental Updates:** ```sql -- ✓ GOOD: Same-length updates (incremental) UPDATE users SET email = 'newemail@example.com' -- 21 chars WHERE email = 'oldemail@example.com'; -- 21 chars -- ⚠ ACCEPTABLE: Different-length (full reindex) UPDATE users SET email = 'verylongemail@example.com' -- 27 chars WHERE email = 'short@ex.co'; -- 11 chars ``` **Batch Updates:** ```sql -- For large update batches SELECT biscuit_disable_triggers(); UPDATE products SET name = UPPER(name); SELECT biscuit_build_index('products', 'name', 'id'); SELECT biscuit_enable_triggers(); ``` #### 3. Delete Strategy **Individual Deletes:** ```sql -- Lazy deletion is O(1), no optimization needed DELETE FROM products WHERE id = 123; ``` **Bulk Deletes:** ```sql -- Option 1: Let lazy deletion handle (recommended for <10K deletes) DELETE FROM products WHERE category = 'obsolete'; -- Option 2: Disable/rebuild (for >100K deletes) SELECT biscuit_disable_triggers(); DELETE FROM products WHERE created_at < '2020-01-01'; SELECT biscuit_build_index('products', 'name', 'id'); SELECT biscuit_enable_triggers(); ``` **Cleanup Scheduling:** ```sql -- Schedule periodic cleanup during low-traffic hours -- Example: Every night at 2 AM via pg_cron SELECT cron.schedule('biscuit-cleanup', '0 2 * * *', 'SELECT biscuit_cleanup()'); ``` ### Memory Optimization #### 1. Monitor Memory Growth ```sql -- Track memory over time SELECT NOW() AS checked_at, biscuit_get_active_count() AS records, (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC AS mb, ROUND((regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC / NULLIF(biscuit_get_active_count(), 0) * 1024, 2) AS kb_per_record; ``` #### 2. Memory Reduction Strategies **Rebuild Compaction:** ```sql -- Periodic rebuild to compact fragmentation SELECT biscuit_build_index('table', 'column', 'pk'); ``` **Character Diversity Reduction (Application Level):** ```sql -- Normalize data before indexing ALTER TABLE users ADD COLUMN email_normalized TEXT; UPDATE users SET email_normalized = LOWER(email); SELECT biscuit_setup('users', 'email_normalized', 'id'); ``` ### Maintenance Optimization #### 1. Tombstone Management **Adaptive Threshold Tuning:** ```sql -- Current threshold: 1000 (default) -- Adjust based on workload: -- Read-heavy (Q/D ratio > 20): Increase threshold -- Write-heavy (Q/D ratio < 10): Decrease threshold -- Balanced: Keep default -- Manual threshold adjustment (via C code recompilation) #define TOMBSTONE_CLEANUP_THRESHOLD 5000 // Adjust value ``` **Proactive Cleanup:** ```sql -- Before performance-critical operations SELECT biscuit_cleanup(); -- Before maintenance windows SELECT biscuit_cleanup(); ``` #### 2. Index Rebuild Strategy **When to Rebuild:** - After bulk operations (>10% of table size) - Memory usage exceeds expectations by >50% - After schema changes - Periodic maintenance (monthly/quarterly) **Rebuild Procedure:** ```sql BEGIN; SELECT biscuit_disable_triggers(); -- Perform any table maintenance SELECT biscuit_build_index('table', 'column', 'pk'); SELECT biscuit_enable_triggers(); COMMIT; ``` ### Concurrency Optimization #### 1. Read Concurrency **Queries are lock-free:** ```sql -- Multiple concurrent queries: No contention SELECT * FROM biscuit_match('%pattern1%'); -- Session 1 SELECT * FROM biscuit_match('%pattern2%'); -- Session 2 SELECT * FROM biscuit_match('%pattern3%'); -- Session 3 -- All execute concurrently without blocking ``` #### 2. Write Concurrency **Minimize trigger overhead:** ```sql -- ✓ OPTIMAL: Batch small inserts INSERT INTO products SELECT * FROM new_products_batch; -- ⚠ SUBOPTIMAL: Individual inserts in loop -- Each insert triggers index update ``` **Transaction Batching:** ```sql -- Group related operations BEGIN; INSERT INTO products VALUES (...); UPDATE products SET name = ... WHERE id = ...; DELETE FROM products WHERE id = ...; COMMIT; -- All index updates happen at commit ``` ### Performance Testing #### Benchmark Query Performance ```sql -- Timing template \timing on SELECT * FROM biscuit_match('%pattern%'); \timing off -- With EXPLAIN ANALYZE EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM biscuit_match('%pattern%'); ``` #### Compare with Traditional Methods ```sql -- BISCUIT \timing on SELECT COUNT(*) FROM biscuit_match('%search%'); \timing off -- Traditional LIKE \timing on SELECT COUNT(*) FROM table WHERE column LIKE '%search%'; \timing off -- GIN Index \timing on SELECT COUNT(*) FROM table WHERE column ILIKE '%search%'; \timing off ``` ```sql -- ✗ NOT SUPPORTED SELECT * FROM biscuit_match('[0-9]+'); SELECT * FROM biscuit_match('(abc|xyz)'); -- ✓ EQUIVALENT WORKAROUNDS SELECT * FROM biscuit_match('abc%') UNION SELECT * FROM biscuit_match('xyz%'); ``` ### Architectural Constraints #### 1. Memory Residency **Index Location:** Shared memory (non-persistent) **Implications:** - Index lost on server restart - Must rebuild after restart - No automatic persistence **Startup Procedure:** ```sql -- Add to database initialization script \c your_database SELECT biscuit_setup('table1', 'column1', 'pk1'); SELECT biscuit_setup('table2', 'column2', 'pk2'); ``` #### 2. Single Index Instance **Limitation:** Only one BISCUIT index per PostgreSQL instance **Implications:** ```sql -- ✗ Cannot maintain multiple indexes simultaneously SELECT biscuit_setup('table1', 'col1', 'id'); SELECT biscuit_setup('table2', 'col2', 'id'); -- Replaces table1 index -- ✓ Workaround: Separate databases CREATE DATABASE db1; \c db1 SELECT biscuit_setup('table', 'column', 'id'); CREATE DATABASE db2; \c db2 SELECT biscuit_setup('table', 'column', 'id'); ``` #### 3. Transaction Isolation **Current Behavior:** Index updates commit with transaction **Limitation:** No snapshot isolation for queries ```sql BEGIN; INSERT INTO products VALUES (1, 'New Product'); -- Index updated in current transaction SELECT * FROM biscuit_match('%New%'); -- Sees new product ROLLBACK; -- Index reverted (product removed) SELECT * FROM biscuit_match('%New%'); -- No longer sees product ``` **Note:** This is expected behavior and consistent with PostgreSQL's MVCC model at the trigger level. #### 4. Single Column Per Index **Current Design:** One index per table, one column per index ```sql -- ✗ Cannot index multiple columns simultaneously -- Must choose one column -- ✓ Workaround: Concatenated column ALTER TABLE products ADD COLUMN search_text TEXT GENERATED ALWAYS AS (name || ' ' || description || ' ' || sku) STORED; SELECT biscuit_setup('products', 'search_text', 'id'); ``` #### 5. NULL Handling **Behavior:** NULL values treated as empty strings ```sql INSERT INTO products (id, name) VALUES (1, NULL); -- NULL treated as '' SELECT * FROM biscuit_match('%'); -- Includes row 1 SELECT * FROM biscuit_match('%%'); -- Includes row 1 SELECT biscuit_match_count(''); -- Counts row 1 if name is NULL ``` ### Performance Limitations #### 1. Pattern Complexity Scaling **Complexity:** O(k × m), k = pattern parts, m = search window ```sql -- Linear growth in query time with pattern parts SELECT biscuit_match('%a%'); -- ~2ms SELECT biscuit_match('%a%b%'); -- ~5ms SELECT biscuit_match('%a%b%c%'); -- ~12ms SELECT biscuit_match('%a%b%c%d%'); -- ~25ms SELECT biscuit_match('%a%b%c%d%e%'); -- ~52ms ``` **Recommendation:** Limit patterns to 3-4 parts for optimal performance #### 2. Write Concurrency **Limitation:** Trigger-based updates serialize at row level **Impact:** ``` High-concurrency writes (>10K/sec): May experience contention Trigger overhead accumulates Solution: Batch writes Disable triggers for bulk operations Consider application-level queue ``` #### 3. Memory Growth **Behavior:** Memory grows with: - Record count - String length diversity - Character diversity - Tombstone accumulation ```sql -- Memory bounds Minimum: 10% of indexed data size Maximum: 50% of indexed data size Typical: 20-25% of indexed data size ``` **Mitigation:** ```sql -- Periodic rebuild for compaction SELECT biscuit_build_index('table', 'column', 'pk'); -- Cleanup tombstones SELECT biscuit_cleanup(); ``` ### Compatibility Constraints #### 1. PostgreSQL Version **Minimum:** PostgreSQL 11.0 **Tested Versions:** - PostgreSQL 11.x: ✓ Supported - PostgreSQL 12.x: ✓ Supported - PostgreSQL 13.x: ✓ Supported - PostgreSQL 14.x: ✓ Supported - PostgreSQL 15.x: ✓ Supported - PostgreSQL 16.x: ✓ Supported (expected) #### 2. Data Type Support **Primary Key Types:** - Fully Supported: INT2, INT4, INT8, TEXT, VARCHAR, CHAR, UUID - Generic Support: All types with output functions - Not Supported: Composite types, arrays (without custom casting) **Indexed Column Types:** - Required: TEXT, VARCHAR, CHAR, or castable to TEXT - Not Supported: Binary types (BYTEA), JSON, arrays without casting #### 3. Extension Dependencies **Required:** - PostgreSQL server-side C API - Standard PostgreSQL libraries **Optional:** - CRoaring library (for optimized bitmaps) - Falls back to built-in implementation if unavailable ### Security Considerations #### 1. Permissions **Required Privileges:** ```sql -- Index creation requires GRANT CREATE ON DATABASE TO user; -- For extension GRANT ALL ON TABLE target_table TO user; -- For triggers -- Query execution requires GRANT SELECT ON TABLE target_table TO user; ``` #### 2. SQL Injection **Safe Query Patterns:** ```sql -- ✓ SAFE: Parameterized queries PREPARE search_stmt AS SELECT * FROM biscuit_match($1); EXECUTE search_stmt('%user_input%'); -- ✓ SAFE: Application sanitization pattern = sanitize_input(user_input) query = f"SELECT * FROM biscuit_match('{pattern}')" ``` **Unsafe Patterns:** ```sql -- ✗ UNSAFE: Direct string concatenation SELECT * FROM biscuit_match('%' || user_input || '%'); ``` #### 3. Resource Exhaustion **Potential Vectors:** - Complex patterns (>5 parts): High CPU usage - Very long patterns: Memory allocation - Rapid index rebuilds: Memory fragmentation **Mitigation:** ```sql -- Rate limit index rebuilds -- Validate pattern complexity before execution -- Set statement timeout SET statement_timeout = '5s'; ``` ### Known Issues #### 1. Index Persistence **Issue:** Index not persistent across restarts **Status:** Design limitation (in-memory structure) **Workaround:** Rebuild on startup via initialization script #### 2. Multiple Index Support **Issue:** Only one index per PostgreSQL instance **Status:** Architectural limitation **Workaround:** Use multiple databases or implement queuing #### 3. Incremental Update Threshold **Issue:** Updates with >20% character changes trigger full reindex **Status:** Configurable threshold (requires C code modification) **Workaround:** Design application to minimize large updates ### Future Enhancements (Roadmap) **Planned:** - Persistent index storage - Multiple concurrent indexes - Case-insensitive mode option - Configurable position limit (>256 chars) - Pattern query optimizer - Index statistics collection **Under Consideration:** - Fuzzy matching support - Phonetic matching - Unicode normalization - Multi-column composite indexes - Parallel query execution - Incremental checkpoint/restore --- ## Troubleshooting ### Common Issues and Solutions #### Issue 1: "Index not built" Error **Symptom:** ```sql SELECT * FROM biscuit_match('%pattern%'); ERROR: Index not built. Call biscuit_build_index() first. ``` **Cause:** Index not initialized or server restarted **Solution:** ```sql -- Option 1: Quick setup SELECT biscuit_setup('table_name', 'column_name', 'pk_column'); -- Option 2: Manual build SELECT biscuit_build_index('table_name', 'column_name', 'pk_column'); SELECT biscuit_create_match_function(); SELECT biscuit_enable_triggers(); ``` **Prevention:** ```sql -- Add to database startup script -- File: /etc/postgresql/14/main/init_biscuit.sql \c your_database SELECT biscuit_setup('table1', 'column1', 'pk1'); ``` --- #### Issue 2: Query Returns No Results **Symptom:** ```sql SELECT biscuit_match_count('%known_value%'); -- Returns: 0 (expected >0) ``` **Diagnosis:** ```sql -- Check if value exists in table SELECT COUNT(*) FROM table_name WHERE column_name LIKE '%known_value%'; -- Check index status SELECT biscuit_index_status(); -- Check trigger status SELECT tgname, tgenabled FROM pg_trigger WHERE tgname = 'biscuit_auto_update'; ``` **Possible Causes and Solutions:** **A. Index Out of Sync** ```sql -- Rebuild index SELECT biscuit_build_index('table_name', 'column_name', 'pk_column'); ``` **B. Case Sensitivity** ```sql -- Pattern is case-sensitive SELECT * FROM biscuit_match('%VALUE%'); -- Won't match "value" -- Solution: Normalize SELECT * FROM biscuit_match(LOWER('%VALUE%')); ``` **C. Pattern Beyond 256 Characters** ```sql -- Match occurs after position 256 SELECT LENGTH(column_name) FROM table_name WHERE id = 123; -- Returns: 280 -- Solution: Pattern must be within first 256 chars ``` --- #### Issue 3: Slow Query Performance **Symptom:** ```sql \timing on SELECT * FROM biscuit_match('%pattern%'); Time: 850.234 ms -- Expected: <10ms ``` **Diagnosis:** ```sql -- Check tombstone count SELECT biscuit_get_tombstone_count(); -- If >5000: Cleanup needed -- Check pattern complexity -- Count % symbols SELECT regexp_count('%a%b%c%d%', '%'); -- If >4: Pattern too complex -- Check memory usage SELECT biscuit_index_status(); ``` **Solutions:** **A. High Tombstone Count** ```sql SELECT biscuit_cleanup(); -- Should restore normal performance ``` **B. Complex Pattern** ```sql -- Simplify pattern -- From: SELECT * FROM biscuit_match('%a%b%c%d%e%'); -- To: SELECT * FROM biscuit_match('%abc%de%'); ``` **C. Large Result Set** ```sql -- Check selectivity SELECT biscuit_match_count('%pattern%'); -- If >50% of table: Not ideal for index -- Solution: Add additional filters SELECT * FROM biscuit_match('%pattern%') WHERE created_at > '2024-01-01'; ``` --- #### Issue 4: High Memory Usage **Symptom:** ```sql SELECT biscuit_index_status(); -- Memory: 500 MB (expected ~20 MB for dataset) ``` **Diagnosis:** ```sql -- Check record count SELECT biscuit_get_active_count(); -- Check string characteristics SELECT COUNT(*) AS records, AVG(LENGTH(column_name)) AS avg_length, MAX(LENGTH(column_name)) AS max_length, COUNT(DISTINCT column_name) AS unique_values FROM table_name; -- Check tombstone accumulation SELECT biscuit_get_tombstone_count(); ``` **Solutions:** **A. Tombstone Accumulation** ```sql SELECT biscuit_cleanup(); -- Then check memory again ``` **B. Memory Fragmentation** ```sql -- Rebuild to compact SELECT biscuit_build_index('table_name', 'column_name', 'pk_column'); ``` **C. Unexpected Data Characteristics** ```sql -- Very long strings or high diversity -- Consider normalizing data UPDATE table_name SET column_name = LEFT(column_name, 200); SELECT biscuit_build_index('table_name', 'column_name', 'pk_column'); ``` --- #### Issue 5: Trigger Not Firing **Symptom:** ```sql INSERT INTO table_name (id, column_name) VALUES (999, 'test'); SELECT * FROM biscuit_match('%test%'); -- Returns: no rows (expected 1 row) ``` **Diagnosis:** ```sql -- Check trigger existence SELECT tgname, tgenabled, tgisinternal FROM pg_trigger WHERE tgrelid = 'table_name'::regclass AND tgname = 'biscuit_auto_update'; -- Check for errors in logs SHOW log_destination; -- Review PostgreSQL logs for trigger errors ``` **Solutions:** **A. Trigger Disabled** ```sql SELECT biscuit_enable_triggers(); ``` **B. Trigger Missing** ```sql SELECT biscuit_setup('table_name', 'column_name', 'pk_column'); -- Or manually SELECT biscuit_enable_triggers(); ``` **C. Trigger Error** ```sql -- Check PostgreSQL logs -- Common causes: -- - NULL primary key -- - Type conversion error -- - Memory allocation failure -- Rebuild to reset SELECT biscuit_build_index('table_name', 'column_name', 'pk_column'); SELECT biscuit_enable_triggers(); ``` --- #### Issue 6: "NULL primary key" Error **Symptom:** ```sql INSERT INTO table_name (column_name) VALUES ('test'); ERROR: NULL primary key in INSERT ``` **Cause:** Primary key column contains NULL **Solution:** **A. Use Serial/Identity Column** ```sql ALTER TABLE table_name ALTER COLUMN id ADD GENERATED ALWAYS AS IDENTITY; ``` **B. Set Default Value** ```sql ALTER TABLE table_name ALTER COLUMN id SET DEFAULT nextval('table_name_id_seq'); ``` **C. Ensure Non-NULL Constraint** ```sql ALTER TABLE table_name ALTER COLUMN id SET NOT NULL; ``` --- #### Issue 7: Out of Memory **Symptom:** ```sql SELECT biscuit_build_index('large_table', 'text_column', 'id'); ERROR: out of memory ``` **Diagnosis:** ```sql -- Check table size SELECT pg_size_pretty(pg_total_relation_size('large_table')) AS table_size, COUNT(*) AS row_count, AVG(LENGTH(text_column)) AS avg_length FROM large_table; -- Check available memory SHOW shared_buffers; SHOW work_mem; ``` **Solutions:** **A. Increase Memory Allocation** ```sql -- In postgresql.conf shared_buffers = 2GB -- Increase from default work_mem = 256MB -- Increase from default -- Restart PostgreSQL sudo systemctl restart postgresql ``` **B. Partition Table** ```sql -- Create partitioned table CREATE TABLE large_table_partitioned ( id SERIAL PRIMARY KEY, text_column TEXT, created_at DATE ) PARTITION BY RANGE (created_at); -- Create partitions CREATE TABLE large_table_2023 PARTITION OF large_table_partitioned FOR VALUES FROM ('2023-01-01') TO ('2024-01-01'); -- Index each partition separately SELECT biscuit_setup('large_table_2023', 'text_column', 'id'); ``` **C. Reduce String Diversity** ```sql -- Normalize before indexing UPDATE large_table SET text_column = LOWER(text_column); SELECT biscuit_build_index('large_table', 'text_column', 'id'); ``` --- #### Issue 8: Index-Table Mismatch **Symptom:** ```sql SELECT COUNT(*) FROM table_name; -- Returns: 100000 SELECT biscuit_get_active_count(); -- Returns: 95000 (mismatch!) ``` **Diagnosis:** ```sql -- Check trigger status SELECT tgenabled FROM pg_trigger WHERE tgname = 'biscuit_auto_update'; -- Check for errors SELECT biscuit_index_status(); -- Manual comparison WITH table_data AS ( SELECT COUNT(*) AS cnt FROM table_name ), index_data AS ( SELECT biscuit_get_active_count() AS cnt ) SELECT t.cnt AS table_count, i.cnt AS index_count, t.cnt - i.cnt AS discrepancy FROM table_data t, index_data i; ``` **Solutions:** **A. Rebuild Index** ```sql SELECT biscuit_build_index('table_name', 'column_name', 'pk_column'); SELECT biscuit_enable_triggers(); ``` **B. Check for Concurrent Modifications** ```sql -- Ensure exclusive access during rebuild BEGIN; LOCK TABLE table_name IN EXCLUSIVE MODE; SELECT biscuit_build_index('table_name', 'column_name', 'pk_column'); COMMIT; ``` --- ### Diagnostic Queries #### Comprehensive Health Check ```sql DO $ DECLARE table_count BIGINT; index_count INTEGER; tombstone_count INTEGER; free_slots INTEGER; memory_mb NUMERIC; BEGIN EXECUTE 'SELECT COUNT(*) FROM your_table' INTO table_count; index_count := biscuit_get_active_count(); tombstone_count := biscuit_get_tombstone_count(); free_slots := biscuit_get_free_slots(); memory_mb := (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC; RAISE NOTICE 'BISCUIT Health Check:'; RAISE NOTICE ' Table rows: %', table_count; RAISE NOTICE ' Index active: %', index_count; RAISE NOTICE ' Discrepancy: %', table_count - index_count; RAISE NOTICE ' Tombstones: %', tombstone_count; RAISE NOTICE ' Free slots: %', free_slots; RAISE NOTICE ' Memory: % MB', memory_mb; IF table_count != index_count THEN RAISE WARNING 'Index out of sync! Rebuild recommended.'; END IF; IF tombstone_count > 5000 THEN RAISE WARNING 'High tombstone count! Cleanup recommended.'; END IF; IF memory_mb > (index_count::NUMERIC / 1000000 * 30) THEN RAISE WARNING 'High memory usage! Check for issues.'; END IF; END $; ``` #### Performance Baseline ```sql -- Establish performance baseline \timing on SELECT biscuit_match_count('%'); -- Baseline: all records SELECT biscuit_match_count('%common_char%'); -- Baseline: common pattern SELECT biscuit_match_count('prefix%'); -- Baseline: prefix SELECT biscuit_match_count('%suffix'); -- Baseline: suffix \timing off -- Save baselines and compare periodically ``` #### Memory Leak Detection ```sql -- Monitor memory over operations SELECT (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC AS mb_before; -- Perform operations INSERT INTO table_name SELECT generate_series(1, 10000), 'test' || generate_series(1, 10000); DELETE FROM table_name WHERE id > 50000; SELECT biscuit_cleanup(); -- Check memory after SELECT (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC AS mb_after; -- Memory should be stable or decrease after cleanup ``` ### Error Code Reference | Error Message | Cause | Solution | |--------------|-------|----------| | Index not built | No index initialized | Run `biscuit_setup()` | | NULL primary key | PK value is NULL | Ensure PK constraint | | Column not found | Invalid column name | Verify table schema | | Insufficient permissions | User lacks privileges | Grant table permissions | | Out of memory | Insufficient RAM | Increase shared_buffers | | Table not found | Invalid table name | Check table existence | | Index already exists | Duplicate index | Drop and rebuild | | Trigger already exists | Duplicate trigger | Use `biscuit_disable_triggers()` first | ### Getting Help **Before Seeking Support:** 1. Run comprehensive health check (above) 2. Review PostgreSQL logs for errors 3. Check extension version: `SELECT biscuit_version()` 4. Verify PostgreSQL version: `SELECT version()` 5. Document reproduction steps **Support Channels:** - GitHub Issues: [Project Repository] - Mailing List: [Contact Information] - Documentation: [Online Docs URL] **Information to Provide:** - Extension version - PostgreSQL version - Operating system - Table schema - Reproduction steps - Error messages - Output of `SELECT biscuit_index_status()` --- ## Examples and Use Cases ### Example 1: E-Commerce Product Search **Scenario:** Product catalog with 5M products, need fast SKU and name searches **Schema:** ```sql CREATE TABLE products ( id SERIAL PRIMARY KEY, sku VARCHAR(50) NOT NULL, name TEXT NOT NULL, description TEXT, category VARCHAR(100), price NUMERIC(10, 2), created_at TIMESTAMPTZ DEFAULT NOW() ); -- Sample data INSERT INTO products (sku, name, category, price) VALUES ('ELEC-LAP-001', 'Dell XPS 15 Laptop', 'Electronics', 1299.99), ('ELEC-LAP-002', 'MacBook Pro 16"', 'Electronics', 2499.99), ('HOME-FURN-001', 'Modern Office Chair', 'Furniture', 349.99), ('HOME-FURN-002', 'Standing Desk', 'Furniture', 599.99); ``` **BISCUIT Setup:** ```sql -- Index product names SELECT biscuit_setup('products', 'name', 'id'); ``` **Query Examples:** ```sql -- Search by product name SELECT * FROM biscuit_match('%Laptop%'); -- Case-insensitive search (with normalized column) ALTER TABLE products ADD COLUMN name_lower TEXT GENERATED ALWAYS AS (LOWER(name)) STORED; SELECT biscuit_setup('products', 'name_lower', 'id'); SELECT * FROM biscuit_match('%laptop%'); -- Complex search SELECT * FROM biscuit_match('%office%chair%') WHERE price BETWEEN 200 AND 500; -- Count matches SELECT biscuit_match_count('%macbook%'); -- Top sellers matching pattern SELECT p.*, s.sales_count FROM biscuit_match('%laptop%') p JOIN sales_summary s ON p.id = s.product_id ORDER BY s.sales_count DESC LIMIT 10; ``` **Performance Comparison:** ```sql -- Traditional LIKE (sequential scan) \timing on SELECT COUNT(*) FROM products WHERE name LIKE '%Laptop%'; Time: 1850 ms -- BISCUIT SELECT biscuit_match_count('%Laptop%'); Time: 3.2 ms -- Speedup: 578× ``` --- ### Example 2: Customer Email Domain Analysis **Scenario:** Analyze customer email patterns, identify providers, find similar domains **Schema:** ```sql CREATE TABLE customers ( customer_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), email TEXT NOT NULL, first_name TEXT, last_name TEXT, signup_date DATE DEFAULT CURRENT_DATE, status VARCHAR(20) DEFAULT 'active' ); -- Sample data INSERT INTO customers (email, first_name, last_name) VALUES ('john.doe@gmail.com', 'John', 'Doe'), ('jane.smith@yahoo.com', 'Jane', 'Smith'), ('bob.wilson@company.com', 'Bob', 'Wilson'), ('alice.brown@gmail.com', 'Alice', 'Brown'); ``` **BISCUIT Setup:** ```sql -- Note: UUID primary key fully supported SELECT biscuit_setup('customers', 'email', 'customer_id'); ``` **Query Examples:** ```sql -- Find all Gmail users SELECT * FROM biscuit_match('%@gmail.com'); -- Find all corporate emails (non-freemail) SELECT * FROM biscuit_match('%@%.com') WHERE email NOT LIKE '%@gmail.com' AND email NOT LIKE '%@yahoo.com' AND email NOT LIKE '%@hotmail.com'; -- Domain distribution analysis WITH email_domains AS ( SELECT customer_id, email, SUBSTRING(email FROM '@(.*)) AS domain FROM biscuit_match('%') ) SELECT domain, COUNT(*) AS customer_count, ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS percentage FROM email_domains GROUP BY domain ORDER BY customer_count DESC LIMIT 10; -- Find emails with specific patterns SELECT * FROM biscuit_match('%.doe@%'); -- All .doe emails SELECT * FROM biscuit_match('admin%@%'); -- Starts with admin SELECT * FROM biscuit_match('%+%@%'); -- Contains plus addressing ``` **Business Intelligence Query:** ```sql -- Signup trends by email provider WITH gmail_users AS ( SELECT signup_date FROM biscuit_match('%@gmail.com') ), yahoo_users AS ( SELECT signup_date FROM biscuit_match('%@yahoo.com') ) SELECT DATE_TRUNC('month', signup_date) AS month, 'Gmail' AS provider, COUNT(*) AS signups FROM gmail_users GROUP BY 1 UNION ALL SELECT DATE_TRUNC('month', signup_date) AS month, 'Yahoo' AS provider, COUNT(*) AS signups FROM yahoo_users GROUP BY 1 ORDER BY month DESC, provider; ``` --- ### Example 3: Log Analysis and Error Detection **Scenario:** 50M log entries, need to quickly find error patterns and anomalies **Schema:** ```sql CREATE TABLE application_logs ( log_id BIGSERIAL PRIMARY KEY, timestamp TIMESTAMPTZ DEFAULT NOW(), level VARCHAR(10), message TEXT, service VARCHAR(50), user_id INTEGER ); -- Sample data INSERT INTO application_logs (level, message, service, user_id) VALUES ('INFO', 'User login successful', 'auth', 12345), ('ERROR', 'Database connection timeout', 'api', 67890), ('WARN', 'Rate limit approaching', 'api', 12345), ('ERROR', 'Failed to connect to payment gateway', 'payments', 11111); ``` **BISCUIT Setup:** ```sql SELECT biscuit_setup('application_logs', 'message', 'log_id'); ``` **Query Examples:** ```sql -- Find all connection errors SELECT * FROM biscuit_match('%connection%') WHERE level = 'ERROR' ORDER BY timestamp DESC LIMIT 100; -- Find timeout issues across all services SELECT service, COUNT(*) AS timeout_count, MIN(timestamp) AS first_seen, MAX(timestamp) AS last_seen FROM biscuit_match('%timeout%') WHERE timestamp > NOW() - INTERVAL '24 hours' GROUP BY service ORDER BY timeout_count DESC; -- Complex error pattern SELECT * FROM biscuit_match('%failed%connect%') WHERE level IN ('ERROR', 'CRITICAL') AND timestamp > NOW() - INTERVAL '1 hour' ORDER BY timestamp DESC; -- Error correlation WITH error_patterns AS ( SELECT log_id, message, timestamp, CASE WHEN message LIKE '%timeout%' THEN 'timeout' WHEN message LIKE '%connection%' THEN 'connection' WHEN message LIKE '%memory%' THEN 'memory' ELSE 'other' END AS error_category FROM biscuit_match('%error%') WHERE timestamp > NOW() - INTERVAL '24 hours' ) SELECT error_category, COUNT(*) AS occurrences, COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS percentage FROM error_patterns GROUP BY error_category ORDER BY occurrences DESC; ``` **Real-Time Monitoring:** ```sql -- Create monitoring view CREATE OR REPLACE VIEW recent_errors AS SELECT log_id, timestamp, level, message, service, user_id FROM biscuit_match('%error%') WHERE timestamp > NOW() - INTERVAL '5 minutes' ORDER BY timestamp DESC; -- Query view SELECT * FROM recent_errors; ``` --- ### Example 4: Genetic Sequence Matching **Scenario:** Bioinformatics database with DNA sequences, need pattern matching **Schema:** ```sql CREATE TABLE dna_sequences ( sequence_id VARCHAR(50) PRIMARY KEY, organism TEXT, sequence TEXT, -- ATCG sequence length INTEGER, discovered_date DATE ); -- Sample data INSERT INTO dna_sequences VALUES ('SEQ001', 'E. coli', 'ATCGATCGATCG', 12, '2024-01-15'), ('SEQ002', 'Human', 'GCTAGCTAGCTA', 12, '2024-01-16'), ('SEQ003', 'E. coli', 'ATCGGGCCCATCG', 13, '2024-01-17'); ``` **BISCUIT Setup:** ```sql SELECT biscuit_setup('dna_sequences', 'sequence', 'sequence_id'); ``` **Query Examples:** ```sql -- Find sequences with specific motif SELECT * FROM biscuit_match('%ATCG%'); -- Find palindromic sequences (conceptual) SELECT * FROM biscuit_match('%GATC%'); -- Start codon -- Complex pattern matching SELECT * FROM biscuit_match('%ATC%GGG%') WHERE organism = 'E. coli'; -- Count sequences by pattern SELECT 'Contains ATCG' AS pattern, biscuit_match_count('%ATCG%') AS count UNION ALL SELECT 'Starts with GC', biscuit_match_count('GC%') UNION ALL SELECT 'Ends with TA', biscuit_match_count('%TA'); ``` --- ### Example 5: Social Media Username Search **Scenario:** Social platform with 100M users, need fast username search **Schema:** ```sql CREATE TABLE users ( user_id BIGSERIAL PRIMARY KEY, username VARCHAR(50) UNIQUE NOT NULL, display_name TEXT, bio TEXT, created_at TIMESTAMPTZ DEFAULT NOW(), follower_count INTEGER DEFAULT 0 ); -- Sample data INSERT INTO users (username, display_name, bio, follower_count) VALUES ('john_doe_123', 'John Doe', 'Software engineer', 1250), ('jane_smith', 'Jane Smith', 'Data scientist', 3400), ('tech_guru_99', 'Tech Guru', 'Tech enthusiast', 15000); ``` **BISCUIT Setup:** ```sql -- Index lowercase usernames for case-insensitive search ALTER TABLE users ADD COLUMN username_lower TEXT GENERATED ALWAYS AS (LOWER(username)) STORED; SELECT biscuit_setup('users', 'username_lower', 'user_id'); ``` **Query Examples:** ```sql -- Search usernames SELECT user_id, username, display_name, follower_count FROM biscuit_match('%tech%') ORDER BY follower_count DESC; -- Find available similar usernames WITH desired_username AS ( SELECT 'john_doe' AS base ), existing_matches AS ( SELECT username FROM biscuit_match('john_doe%') ) SELECT base || '_' || generate_series(1, 10) AS suggestion FROM desired_username WHERE base || '_' || generate_series(1, 10) NOT IN ( SELECT username FROM existing_matches ) LIMIT 5; -- Find influencers matching pattern SELECT username, display_name, follower_count FROM biscuit_match('%guru%') WHERE follower_count > 10000 ORDER BY follower_count DESC; -- Username analytics WITH username_patterns AS ( SELECT CASE WHEN username LIKE '%\_\_\_%' THEN 'has_underscores' WHEN username LIKE '%[0-9]%' THEN 'has_numbers' ELSE 'alphanumeric' END AS pattern_type, follower_count FROM biscuit_match('%') ) SELECT pattern_type, COUNT(*) AS user_count, AVG(follower_count) AS avg_followers FROM username_patterns GROUP BY pattern_type; ``` --- ### Example 6: Document Management System **Scenario:** Legal document repository, need full-text-like search on titles **Schema:** ```sql CREATE TABLE documents ( doc_id SERIAL PRIMARY KEY, title TEXT NOT NULL, document_type VARCHAR(50), file_path TEXT, uploaded_by INTEGER, uploaded_at TIMESTAMPTZ DEFAULT NOW(), case_number VARCHAR(50) ); -- Sample data INSERT INTO documents (title, document_type, case_number) VALUES ('Contract Agreement - Smith vs Johnson', 'Contract', 'CASE-2024-001'), ('Patent Application for AI Technology', 'Patent', 'PAT-2024-042'), ('Lease Agreement - Commercial Property', 'Lease', 'CASE-2024-002'); ``` **BISCUIT Setup:** ```sql -- Create searchable column with normalized text ALTER TABLE documents ADD COLUMN title_search TEXT GENERATED ALWAYS AS (LOWER(REGEXP_REPLACE(title, '[^a-z0-9 ]', '', 'gi'))) STORED; SELECT biscuit_setup('documents', 'title_search', 'doc_id'); ``` **Query Examples:** ```sql -- Find contract documents SELECT * FROM biscuit_match('%contract%'); -- Multi-term search (all terms must appear) SELECT * FROM biscuit_match('%lease%commercial%') WHERE document_type = 'Lease'; -- Case document search SELECT doc_id, title, case_number, uploaded_at FROM biscuit_match('%smith%johnson%') ORDER BY uploaded_at DESC; -- Document type distribution WITH doc_matches AS ( SELECT document_type, COUNT(*) AS cnt FROM biscuit_match('%agreement%') GROUP BY document_type ) SELECT document_type, cnt, ROUND(100.0 * cnt / SUM(cnt) OVER (), 2) AS percentage FROM doc_matches ORDER BY cnt DESC; ``` --- ### Example 7: Bulk Data Migration with BISCUIT **Scenario:** Migrating 10M records from legacy system **Migration Script:** ```sql -- Step 1: Create target table CREATE TABLE products_new ( id INTEGER PRIMARY KEY, sku TEXT, name TEXT, description TEXT ); -- Step 2: Prepare BISCUIT (without triggers for bulk) SELECT biscuit_build_index('products_new', 'name', 'id'); -- Step 3: Bulk load data \timing on SELECT biscuit_disable_triggers(); COPY products_new FROM '/data/products.csv' CSV HEADER; -- Alternative: INSERT SELECT from staging Time: 45,000 ms -- Step 4: Rebuild index SELECT biscuit_build_index('products_new', 'name', 'id'); Time: 38,000 ms -- Step 5: Enable triggers for future updates SELECT biscuit_enable_triggers(); \timing off -- Step 6: Verify SELECT COUNT(*) AS table_count, biscuit_get_active_count() AS index_count, biscuit_get_free_slots() AS free_slots FROM products_new; -- Step 7: Test queries SELECT biscuit_match_count('%'); -- Should match table count ``` --- ### Example 8: Multi-Pattern Search Interface **Scenario:** Build API endpoint that accepts multiple search patterns **Application Code (Python + psycopg2):** ```python import psycopg2 from typing import List, Dict def multi_pattern_search(patterns: List[str], table: str) -> List[Dict]: """ Search for records matching any of the provided patterns """ conn = psycopg2.connect("dbname=mydb user=myuser") cur = conn.cursor() # Build UNION query for multiple patterns union_queries = [] for pattern in patterns: union_queries.append( f"SELECT * FROM biscuit_match(%s)" ) query = " UNION ".join(union_queries) cur.execute(query, patterns) results = cur.fetchall() cur.close() conn.close() return results # Usage results = multi_pattern_search( patterns=['%laptop%', '%macbook%', '%computer%'], table='products' ) ``` **SQL Equivalent:** ```sql -- Search multiple patterns SELECT * FROM biscuit_match('%laptop%') UNION SELECT * FROM biscuit_match('%macbook%') UNION SELECT * FROM biscuit_match('%computer%') ORDER BY id; -- Or with CTE for clarity WITH pattern1 AS (SELECT * FROM biscuit_match('%laptop%')), pattern2 AS (SELECT * FROM biscuit_match('%macbook%')), pattern3 AS (SELECT * FROM biscuit_match('%computer%')) SELECT * FROM pattern1 UNION SELECT * FROM pattern2 UNION SELECT * FROM pattern3; ``` --- ### Example 9: Performance Monitoring Dashboard **Scenario:** Real-time monitoring of BISCUIT performance **Monitoring Queries:** ```sql -- Create monitoring function CREATE OR REPLACE FUNCTION biscuit_performance_metrics() RETURNS TABLE ( metric TEXT, value NUMERIC, unit TEXT, status TEXT ) AS $ DECLARE active_ct INTEGER; tombstone_ct INTEGER; free_ct INTEGER; memory_val NUMERIC; query_ct BIGINT; insert_ct BIGINT; update_ct BIGINT; delete_ct BIGINT; BEGIN active_ct := biscuit_get_active_count(); tombstone_ct := biscuit_get_tombstone_count(); free_ct := biscuit_get_free_slots(); memory_val := (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC; query_ct := (regexp_matches(biscuit_index_status(), 'Queries: ([0-9]+)'))[1]::BIGINT; insert_ct := (regexp_matches(biscuit_index_status(), 'Inserts: ([0-9]+)'))[1]::BIGINT; update_ct := (regexp_matches(biscuit_index_status(), 'Updates: ([0-9]+)'))[1]::BIGINT; delete_ct := (regexp_matches(biscuit_index_status(), 'Deletes: ([0-9]+)'))[1]::BIGINT; RETURN QUERY SELECT 'Active Records'::TEXT, active_ct::NUMERIC, 'records'::TEXT, CASE WHEN active_ct > 0 THEN 'OK' ELSE 'WARNING' END; RETURN QUERY SELECT 'Tombstones', tombstone_ct::NUMERIC, 'records', CASE WHEN tombstone_ct < 5000 THEN 'OK' WHEN tombstone_ct < 10000 THEN 'WARNING' ELSE 'CRITICAL' END; RETURN QUERY SELECT 'Free Slots', free_ct::NUMERIC, 'slots', 'INFO'; RETURN QUERY SELECT 'Memory Usage', memory_val, 'MB', CASE WHEN memory_val < 100 THEN 'OK' WHEN memory_val < 500 THEN 'WARNING' ELSE 'HIGH' END; RETURN QUERY SELECT 'Total Queries', query_ct::NUMERIC, 'operations', 'INFO'; RETURN QUERY SELECT 'Total Inserts', insert_ct::NUMERIC, 'operations', 'INFO'; RETURN QUERY SELECT 'Total Updates', update_ct::NUMERIC, 'operations', 'INFO'; RETURN QUERY SELECT 'Total Deletes', delete_ct::NUMERIC, 'operations', 'INFO'; END; $ LANGUAGE plpgsql; -- Use in dashboard SELECT * FROM biscuit_performance_metrics(); ``` **Output:** ``` metric | value | unit | status --------------------+-------+------------+-------- Active Records | 998567| records | OK Tombstones | 856 | records | OK Free Slots | 1433 | slots | INFO Memory Usage | 18.3 | MB | OK Total Queries | 45892 | operations | INFO Total Inserts | 1000000| operations| INFO Total Updates | 125678| operations | INFO Total Deletes | 1433 | operations | INFO ``` --- ### Complete Production Example **Full Setup for Production Environment:** ```sql -- 1. Create schema CREATE SCHEMA production; SET search_path TO production; -- 2. Create table CREATE TABLE products ( id SERIAL PRIMARY KEY, sku VARCHAR(100) UNIQUE NOT NULL, name TEXT NOT NULL, description TEXT, category VARCHAR(100), price NUMERIC(12, 2), stock_level INTEGER DEFAULT 0, created_at TIMESTAMPTZ DEFAULT NOW(), updated_at TIMESTAMPTZ DEFAULT NOW() ); -- 3. Add indexes CREATE INDEX idx_products_category ON products(category); CREATE INDEX idx_products_price ON products(price); -- 4. Setup BISCUIT SELECT biscuit_setup('products', 'name', 'id'); -- 5. Verify setup SELECT biscuit_index_status(); -- 6. Insert sample data INSERT INTO products (sku, name, category, price, stock_level) SELECT 'SKU-' || generate_series(1, 100000), 'Product ' || generate_series(1, 100000), (ARRAY['Electronics', 'Furniture', 'Clothing'])[floor(random() * 3 + 1)], (random() * 1000)::NUMERIC(12, 2), floor(random() * 100)::INTEGER; -- 7. Test queries \timing on SELECT * FROM biscuit_match('%Product 1234%'); SELECT biscuit_match_count('%Electronics%'); \timing off -- 8. Setup monitoring CREATE TABLE biscuit_metrics_log ( logged_at TIMESTAMPTZ DEFAULT NOW(), active_count INTEGER, tombstone_count INTEGER, memory_mb NUMERIC, query_count BIGINT ); -- 9. Schedule monitoring (via pg_cron) SELECT cron.schedule( 'biscuit-metrics', '*/5 * * * *', -- Every 5 minutes $INSERT INTO production.biscuit_metrics_log SELECT NOW(), biscuit_get_active_count(), biscuit_get_tombstone_count(), (regexp_matches(biscuit_index_status(), 'Memory: ([0-9.]+)'))[1]::NUMERIC, (regexp_matches(biscuit_index_status(), 'Queries: ([0-9]+)'))[1]::BIGINT$ ); -- 10. Create API views CREATE VIEW product_search AS SELECT id, sku, name, category, price FROM products WHERE id IN (SELECT id FROM biscuit_match('%')); -- Template view ``` --- ## License PostgreSQL License (similar to MIT/BSD) Copyright (c) 2024 BISCUIT Contributors Permission to use, copy, modify, and distribute this software and its documentation for any purpose, without fee, and without a written agreement is hereby granted, provided that the above copyright notice and this paragraph and the following two paragraphs appear in all copies. --- ## Version History **v1.0.0** (Current) - Initial release - Roaring Bitmap integration - Lazy deletion with tombstones - Incremental updates - O(1) hash table PK lookup - Automatic trigger-based CRUD - Comprehensive monitoring --- ## Contributors BISCUIT is developed and maintained by [Sivaprasad Murali](https://linkedin.com/in/sivaprasad-murali) . --- ## Support and Contact **Issues:** https://github.com/crystallinecore/biscuit/issues **Discussions:** https://github.com/crystallinecore/biscuit/discussions --- **When pg_trgm feels half-baked, grab a BISCUIT 🍪** ---