# Biscuit - High-Performance Pattern Matching Index for PostgreSQL

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PostgreSQL: 12+](https://img.shields.io/badge/PostgreSQL-16%2B-blue.svg)](https://www.postgresql.org/)

**Biscuit** is a specialized PostgreSQL index access method (IAM) designed for blazing-fast pattern matching on `LIKE`  queries, with native support for multi-column searches. It eliminates the recheck overhead of trigram indexes while delivering significant performance improvements on wildcard-heavy queries. It stands for _Bitmap Indexed Searching with Comprehensive Union and Intersection Techniques_.

---

**What's new?**

**v2.0.1 (2024-12-06)**

* **Fixed incorrect results** when using multiple `LIKE` / `NOT LIKE` predicates on the same column.
* Root cause: global (not per-predicate) bitmap inversion.
* Fix: correct per-predicate bitmap handling; all multi-filter patterns now return accurate results.
* Added full, efficient support for the `NOT LIKE` operator.
* Backward-compatible; recommended to update and re-verify critical queries.

---

##  **Installation**

### **Requirements**
- Build tools: `gcc`, `make`, `pg_config`
- Optional: CRoaring library for enhanced performance

### **From Source**

```bash
# Clone repository
git clone https://github.com/Crystallinecore/biscuit.git
cd biscuit

# Build and install
make
sudo make install

# Enable in PostgreSQL
psql -d your_database -c "CREATE EXTENSION biscuit;"
```

### **From PGXN**

```bash
pgxn install biscuit
psql -d your_database -c "CREATE EXTENSION biscuit;"
```

---

##  **Quick Start**

### **Basic Usage**

```sql
-- Create a Biscuit index
CREATE INDEX idx_users_name ON users USING biscuit(name);

-- Query with wildcard patterns
SELECT * FROM users WHERE name LIKE '%john%';
SELECT * FROM users WHERE name NOT LIKE 'a%b%c';
SELECT COUNT(*) FROM users WHERE name LIKE '%test%';
```

### **Multi-Column Indexes**

```sql
-- Create multi-column index
CREATE INDEX idx_products_search 
ON products USING biscuit(name, description, category);

-- Multi-column query (optimized automatically)
SELECT * FROM products 
WHERE name LIKE '%widget%' 
  AND description LIKE '%blue%'
  AND category LIKE 'electronics%'
ORDER BY score DESC 
LIMIT 10;
```

### **Supported Data Types**

Biscuit automatically converts various types to searchable text:

```sql
-- Text types (native)
CREATE INDEX ON logs USING biscuit(message);

-- Numeric types (converted to sortable strings)
CREATE INDEX ON events USING biscuit(user_id, event_code);

-- Date/Time types (converted to sortable timestamps)
CREATE INDEX ON orders USING biscuit(order_date, customer_name);

-- Boolean (converted to 't'/'f')
CREATE INDEX ON flags USING biscuit(is_active, status);
```

---

##  **How It Works**

### **Core Concept: Bitmap Position Indices**

Biscuit builds **two types of character-position bitmaps** for every string:

#### **1. Positive Indices (Forward)**
Tracks which records have character `c` at position `p`:

```
String: "hello"
Bitmaps:
  h@0 → {record_ids...}
  e@1 → {record_ids...}
  l@2 → {record_ids...}
  l@3 → {record_ids...}
  o@4 → {record_ids...}
```

#### **2. Negative Indices (Backward)**
Tracks which records have character `c` at position `-p` from the end:

```
String: "hello"
Bitmaps:
  o@-1 → {record_ids...}  (last char)
  l@-2 → {record_ids...}  (second to last)
  l@-3 → {record_ids...}
  e@-4 → {record_ids...}
  h@-5 → {record_ids...}
```

#### **3. Length Bitmaps**
Two types for fast length filtering:
- **Exact length**: `length[5]` → all 5-character strings
- **Minimum length**: `length_ge[3]` → all strings ≥ 3 characters

---

### **Pattern Matching Algorithm**

#### **Example: `LIKE '%abc%def'`**

**Step 1: Parse pattern into parts**
```
Parts: ["abc", "def"]
Starts with %: YES
Ends with %: NO
```

**Step 2: Match first part as prefix**
```sql
-- "abc" must start at position 0
Candidates = pos[a@0] ∩ pos[b@1] ∩ pos[c@2]
```

**Step 3: Match last part at end (negative indexing)**
```sql
-- "def" must end at string end
Candidates = Candidates ∩ neg[f@-1] ∩ neg[e@-2] ∩ neg[d@-3]
```

**Step 4: Apply length constraint**
```sql
-- String must be at least 6 chars (abc + def)
Candidates = Candidates ∩ length_ge[6]
```

**Result: Exact matches, zero false positives**

---

### **Why It's Fast**

#### **1. Pure Bitmap Operations**
```c
// Traditional approach (pg_trgm)
for each trigram in pattern:
    candidates = scan_trigram_index(trigram)
    for each candidate:
        if !heap_fetch_and_recheck(candidate):  // SLOW: Random I/O
            remove candidate

// Biscuit approach
for each character at position:
    candidates &= bitmap[char][pos]  // FAST: In-memory AND
// No recheck needed!
```

#### **2. Roaring Bitmaps**
Compressed bitmap representation:
- Sparse data: array of integers
- Dense data: bitset
- Automatic conversion for optimal memory

#### **3. Negative Indexing Optimization**
```sql
-- Pattern: '%xyz'
-- Traditional: Scan all strings, check suffix
-- Biscuit: Direct lookup in neg[z@-1] ∩ neg[y@-2] ∩ neg[x@-3]
```

---

##  **12 Performance Optimizations**

### **1. Skip Wildcard Intersections**
```c
// Pattern: "a_c" (underscore = any char)
// OLD: Intersect all 256 chars at position 1
// NEW: Skip position 1 entirely, only check a@0 and c@2
```

### **2. Early Termination on Empty**
```c
result = bitmap[a][0];
result &= bitmap[b][1];
if (result.empty()) return empty;  // Don't process remaining chars
```

### **3. Avoid Redundant Bitmap Copies**
```c
// OLD: Copy bitmap for every operation
// NEW: Operate in-place, copy only when branching
```

### **4. Optimized Single-Part Patterns**
Fast paths for common cases:
- **Exact**: `'abc'` → Check position 0-2 and length = 3
- **Prefix**: `'abc%'` → Check position 0-2 and length ≥ 3
- **Suffix**: `'%xyz'` → Check negative positions -3 to -1
- **Substring**: `'%abc%'` → Check all positions, OR results

### **5. Skip Unnecessary Length Operations**
```c
// Pure wildcard patterns
if (pattern == "%%%___%%")  // 3 underscores
    return length_ge[3];     // No character checks needed!
```

### **6. TID Sorting for Sequential Heap Access**
```c
// Sort TIDs by (block_number, offset) before returning
// Converts random I/O into sequential I/O
// Uses radix sort for >5000 TIDs, quicksort for smaller sets
```

### **7. Batch TID Insertion**
```c
// For bitmap scans, insert TIDs in chunks
for (i = 0; i < num_results; i += 10000) {
    tbm_add_tuples(tbm, &tids[i], batch_size, false);
}
```

### **8. Direct Roaring Iteration**
```c
// OLD: Convert bitmap to array, then iterate
// NEW: Direct iterator, no intermediate allocation
roaring_uint32_iterator_t *iter = roaring_create_iterator(bitmap);
while (iter->has_value) {
    process(iter->current_value);
    roaring_advance_uint32_iterator(iter);
}
```


### **9. Batch Cleanup on Threshold**
```c
// After 1000 deletes, clean tombstones from all bitmaps
if (tombstone_count >= 1000) {
    for each bitmap:
        bitmap &= ~tombstones;  // Batch operation
    tombstones.clear();
}
```

### **10. Aggregate Query Detection**
```c
// COUNT(*), EXISTS, etc. don't need sorted TIDs
if (!scan->xs_want_itup) {
    skip_sorting = true;  // Save sorting time
}
```

### **11. LIMIT-Aware TID Collection**
```c
// If LIMIT 10 in query, don't collect more than needed
if (limit_hint > 0 && collected >= limit_hint)
    break;  // Early termination
```

### **12. Multi-Column Query Optimization**

#### **Predicate Reordering**
Analyzes each column's pattern and executes in order of selectivity:

```sql
-- Query:
WHERE name LIKE '%common%'           -- Low selectivity
  AND sku LIKE 'PROD-2024-%'         -- High selectivity (prefix)
  AND description LIKE '%rare_word%' -- Medium selectivity

-- Execution order (Biscuit automatically reorders):
1. sku LIKE 'PROD-2024-%'         (PREFIX, priority=20, selectivity=0.02)
2. description LIKE '%rare_word%' (SUBSTRING, priority=35, selectivity=0.15)
3. name LIKE '%common%'           (SUBSTRING, priority=55, selectivity=0.60)
```

**Selectivity scoring formula:**
```
score = 1.0 / (concrete_chars + 1)
      - (underscore_count × 0.05)
      + (partition_count × 0.15)
      - (anchor_strength / 200)
```

**Priority tiers:**
1. **0-10**: Exact matches, many underscores
2. **10-20**: Non-% patterns with underscores
3. **20-30**: Strong anchored patterns (prefix/suffix)
4. **30-40**: Weak anchored patterns
5. **40-50**: Multi-partition patterns
6. **50-60**: Substring patterns (lowest priority)

---

##  **Benchmarking**

### **Setup Test Data**

```sql
-- Create 1M row test table
CREATE TABLE benchmark (
    id SERIAL PRIMARY KEY,
    name TEXT,
    description TEXT,
    category TEXT,
    score FLOAT
);

INSERT INTO benchmark (name, description, category, score)
SELECT 
    'Name_' || md5(random()::text),
    'Description_' || md5(random()::text),
    'Category_' || (random() * 100)::int,
    random() * 1000
FROM generate_series(1, 1000000);

-- Create indexes
CREATE INDEX idx_trgm ON benchmark 
    USING gin(name gin_trgm_ops, description gin_trgm_ops);

CREATE INDEX idx_biscuit ON benchmark 
    USING biscuit(name, description, category);

ANALYZE benchmark;
```

### **Run Benchmarks**

```sql
-- Single column, simple pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark WHERE name LIKE '%abc%' LIMIT 100;

-- Multi-column, complex pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark 
WHERE name LIKE '%a%b' 
  AND description LIKE '%bc%cd%'
ORDER BY score DESC 
LIMIT 10;

-- Aggregate query (COUNT)
EXPLAIN ANALYZE
SELECT COUNT(*) FROM benchmark 
WHERE name LIKE 'a%l%' 
  AND category LIKE 'f%d';

-- Complex multi-part pattern
EXPLAIN ANALYZE
SELECT * FROM benchmark 
WHERE description LIKE 'u%dc%x'
LIMIT 50;
```

### **View Index Statistics**

```sql
-- Show internal statistics
SELECT biscuit_index_stats('idx_biscuit'::regclass);
```

**Output:**
```
----------------------------------------------------
 Biscuit Index Statistics (FULLY OPTIMIZED)        +
 ==========================================        +
 Index: idx_biscuit                                +
 Active records: 1000002                           +
 Total slots: 1000002                              +
 Free slots: 0                                     +
 Tombstones: 0                                     +
 Max length: 44                                    +
 ------------------------                          +
 CRUD Statistics:                                  +
   Inserts: 0                                      +
   Updates: 0                                      +
   Deletes: 0                                      +
 ------------------------                          +
 Active Optimizations:                             +
   ✓ 1. Skip wildcard intersections                +
   ✓ 2. Early termination on empty                 +
   ✓ 3. Avoid redundant copies                     +
   ✓ 4. Optimized single-part patterns             +
   ✓ 5. Skip unnecessary length ops                +
   ✓ 6. TID sorting for sequential I/O             +
   ✓ 7. Batch TID insertion                        +
   ✓ 8. Direct bitmap iteration                    +
   ✓ 9. Parallel bitmap scan support               +
   ✓ 10. Batch cleanup on threshold                +
   ✓ 11. Skip sorting for bitmap scans (aggregates)+
   ✓ 12. LIMIT-aware TID collection                +
```

### Results

**> Build time**

| Index       | Command                        | Build Time        |
| ----------- | ------------------------------ | ----------------- |
| **pg_trgm** | `CREATE INDEX idx_trgm ...`    | **20,358.655 ms** |
| **biscuit** | `CREATE INDEX idx_biscuit ...` | **2,734.310 ms**  |

---

##  **Use Cases**

### **1. Full-Text Search Applications**
```sql
-- E-commerce product search
CREATE INDEX idx_products ON products 
    USING biscuit(name, brand, description);

SELECT * FROM products 
WHERE name LIKE '%laptop%' 
  AND brand LIKE 'ABC%'
  AND description LIKE '%gaming%'
ORDER BY price DESC 
LIMIT 20;
```

### **2. Log Analysis**
```sql
-- Search error logs
CREATE INDEX idx_logs ON logs 
    USING biscuit(message, source, level);

SELECT * FROM logs 
WHERE message LIKE '%ERROR%connection%timeout%'
  AND source LIKE 'api.%'
  AND timestamp > NOW() - INTERVAL '1 hour'
LIMIT 100;
```

### **3. Customer Support / CRM**
```sql
-- Search tickets by multiple fields
CREATE INDEX idx_tickets ON tickets 
    USING biscuit(subject, description, customer_name);

SELECT * FROM tickets 
WHERE subject LIKE '%refund%'
  AND customer_name LIKE 'John%'
  AND status = 'open'
ORDER BY created_at DESC;
```

### **4. Code Search / Documentation**
```sql
-- Search code repositories
CREATE INDEX idx_files ON code_files 
    USING biscuit(filename, content, author);

SELECT * FROM code_files 
WHERE filename LIKE '%.py'
  AND content LIKE '%def%parse%json%'
  AND author LIKE 'team-%';
```

### **5. Analytics with Aggregates**
```sql
-- Fast COUNT queries (no sorting overhead)
CREATE INDEX idx_events ON events 
    USING biscuit(event_type, user_agent, referrer);

SELECT COUNT(*) FROM events 
WHERE event_type LIKE 'click%'
  AND user_agent LIKE '%Mobile%'
  AND referrer LIKE '%google%';
```

---

##  **Configuration**

### **Build Options**

Enable CRoaring for better performance:


### **Index Options**

Currently, Biscuit doesn't expose tunable options. All optimizations are automatic.

---

##  **Limitations and Trade-offs**

### **What Biscuit Does NOT Support**

1. **Regular expressions** - Only `LIKE` / `ILIKE` patterns with `%` and `_`
2. **Case-insensitive by default** - Use  PostgreSQL's `ILIKE`  for case-insensitive
3. **Locale-specific collations** - String comparisons are byte-based
4. **Amcanorder = false** - Cannot provide ordered scans directly (but see below)

### **ORDER BY + LIMIT Behavior**

Biscuit doesn't support ordered index scans (`amcanorder = false`), BUT:

**PostgreSQL's planner handles this efficiently:**
```sql
SELECT * FROM table WHERE col LIKE '%pattern%' ORDER BY score LIMIT 10;
```

**Execution plan:**
```
Limit
  -> Sort (cheap, small result set)
    -> Biscuit Index Scan (fast filtering)
```

**Why this works:**
- Biscuit filters candidates extremely fast 
- Result set is small after filtering
- Sorting 100-1000 rows in memory is negligible (<1ms)
- **Net result**: Still much faster than pg_trgm with recheck overhead

### **Memory Usage**

Biscuit stores bitmaps in memory:
- Use `REINDEX` to rebuild if index grows too large

### **Write Performance**

- **INSERT**: Similar to B-tree (must update bitmaps)
- **UPDATE**: Two operations (remove old, insert new)
- **DELETE**: Marks as tombstone, batch cleanup at threshold

---

##  **Comparison with pg_trgm**

| Feature                  | Biscuit               | pg_trgm (GIN)        |
|--------------------------|-----------------------|----------------------|
| **Wildcard patterns**    | ✔ Native, exact       | ✔ Approximate       |
| **Recheck overhead**     | ✔ None (deterministic)  | ✗ Always required   |
| **Multi-column**         | ✔ Optimized           | ⚠️ Via btree_gist    |
| **Aggregate queries**    | ✔ Optimized           | ✗ Same cost         |
| **ORDER BY + LIMIT**     | ✔ Works well          | ✔ Ordered scans     |
| **Memory usage**         | ⚠️ Higher              | ✔ Lower             |
| **Regex support**        | ✗ No                  | ✔ Yes               |
| **Similarity search**    | ✗ No                  | ✔ Yes               |

**When to use Biscuit:**
- ✔ Wildcard-heavy `LIKE` queries (`%`, `_`)
- ✔ Multi-column pattern matching
- ✔ Need exact results (no false positives)
- ✔ `COUNT(*)` / aggregate queries
- ✔ High query volume, can afford memory

**When to use pg_trgm:**
- ✔ Fuzzy/similarity search (`word <-> pattern`)
- ✔ Regular expressions
- ✔ Memory-constrained environments
- ✔ Write-heavy workloads

---

## **Development**

### **Build from Source**

```bash
git clone https://github.com/Crystallinecore/biscuit.git
cd biscuit

# Development build with debug symbols
make clean
CFLAGS="-g -O0 -DDEBUG" make

# Run tests
make installcheck

# Install
sudo make install
```

### **Testing**

```bash
# Unit tests
make installcheck

# Manual testing
psql -d testdb

CREATE EXTENSION biscuit;

-- Create test table
CREATE TABLE test (id SERIAL, name TEXT);
INSERT INTO test (name) VALUES ('hello'), ('world'), ('test');

-- Create index
CREATE INDEX idx_test ON test USING biscuit(name);

-- Test queries
EXPLAIN ANALYZE SELECT * FROM test WHERE name LIKE '%ell%';
```

### **Debugging**

Enable PostgreSQL debug logging:

```sql
SET client_min_messages = DEBUG1;
SET log_min_messages = DEBUG1;

-- Now run queries to see Biscuit's internal logs
SELECT * FROM test WHERE name LIKE '%pattern%';
```

---

##  **Architecture Details**

### **Index Structure**

```
BiscuitIndex
├── num_columns: int
├── column_indices[]: ColumnIndex[]
│   ├── pos_idx[256]: CharIndex    // Forward position bitmaps
│   │   └── entries[]: PosEntry[]
│   │       ├── pos: int
│   │       └── bitmap: RoaringBitmap
│   ├── neg_idx[256]: CharIndex    // Backward position bitmaps
│   ├── char_cache[256]: RoaringBitmap  // Character existence
│   ├── length_bitmaps[]: RoaringBitmap[]  // Exact lengths
│   └── length_ge_bitmaps[]: RoaringBitmap[]  // Min lengths
├── tids[]: ItemPointerData[]      // Record TIDs
├── column_data_cache[][]: char**  // Cached string data
└── tombstones: RoaringBitmap      // Deleted records
```

### **Query Execution Flow**

```
1. biscuit_rescan()
   ├─> Parse LIKE pattern into parts
   ├─> Analyze pattern selectivity (multi-column)
   ├─> Reorder predicates by priority
   └─> For each predicate:
       ├─> biscuit_query_column_pattern()
       │   ├─> Check fast paths (empty, %, pure wildcards)
       │   ├─> Match pattern parts using bitmaps
       │   └─> Return candidate bitmap
       └─> Intersect with previous candidates

2. biscuit_collect_tids_optimized()
   ├─> Detect aggregate vs. regular query
   ├─> Estimate LIMIT hint
   ├─> Collect TIDs from final bitmap
   ├─> Sort if needed (skip for aggregates)
   └─> Apply LIMIT early termination

3. biscuit_gettuple() or biscuit_getbitmap()
   └─> Return results to PostgreSQL executor
```


---

##  **Contributing**

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing`)
3. Make your changes with tests
4. Submit a pull request

### **Areas for Contribution**

- [ ] Add `ILIKE` support (case-insensitive)
- [ ] Implement `amcanorder` for native sorted scans
- [ ] Add statistics collection for better cost estimation
- [ ] Support for more data types (JSON, arrays)
- [ ] Parallel index build
- [ ] Index compression options

---

##  **License**

MIT License - See LICENSE file for details.

---

## **Author**

Sivaprasad Murali
- Email: sivaprasad.off@gmail.com
- GitHub: [@Crystallinecore](https://github.com/Crystallinecore)

---

## **Acknowledgments**

- PostgreSQL community for the extensible index AM framework
- CRoaring library for efficient bitmap operations
- Inspired by the need for faster LIKE query performance in production systems

---

## **Support**

- **Issues**: [GitHub Issues](https://github.com/Crystallinecore/biscuit/issues)
- **Discussions**: [GitHub Discussions](https://github.com/Crystallinecore/biscuit/discussions)

---

**Happy pattern matching! Grab a biscuit 🍪 when pg_trgm feels half-baked!**

---