Compares `re2` throughput against PostgreSQL builtin [POSIX regex (ARE)](https://www.postgresql.org/docs/current/functions-matching.html)

![Benchmark](graph.png)

| Category       | re2                       | builtin                       |
| -------------- | ------------------------- | ----------------------------- |
| match          | `re2match`                | `regexp_like`                 |
| extract        | `re2extract`              | `regexp_substr`               |
| extract all    | `re2extractall`           | `regexp_matches(…, 'g')`      |
| replace one    | `re2replaceregexpone`     | `regexp_replace`              |
| replace all    | `re2replaceregexpall`     | `regexp_replace(…, 'g')`      |
| count matches  | `re2countmatches`         | `regexp_count`                |

Patterns span literal, character class, alternation, nested quantifier, IP /
email validation, deep alternation, and a ReDoS-shaped `(e?){10}e{10}` case.
Both RE2 (automaton) and PG (ARE) handle last one without catastrophic
backtracking

Data is 10000 rows of:

- `email` ~40 chars
- `logline` ~200 chars
- `longtext` ~2000 chars (400 words)

Index scans
-----------

`re2` also speeds up `re2match` through two index mechanisms (see [Index
Support]). These queries compare each against the equivalent PostgreSQL index
scan over a separate 100000-row table.

| Mechanism           | re2                          | postgres                     |
| ------------------- | ---------------------------- | ---------------------------- |
| b-tree prefix range | `re2match(col, '^lit')`      | `col ~ '^lit'`               |
| GIN trigram         | `col @~ pat` (`gin_re2_ops`) | `col ~ pat` (`gin_trgm_ops`) |

![Index benchmark](graph_index.png)

| Category | Pattern                       | rows  | re2     | postgres | re2 vs postgres |
| -------- | ----------------------------- | ----- | ------- | -------- | --------------- |
| btree    | `^user5`                      | 11111 | 1.8 ms  | 3.5 ms   | 1.9x faster     |
| btree    | `^user12[0-9]`                | 1110  | 0.21 ms | 0.43 ms  | 2.0x faster     |
| gin      | `error_code=123`              | 100   | 3.3 ms  | 3.6 ms   | 1.1x faster     |
| gin      | `error_code=(100\|200\|300)`  | 301   | 3.5 ms  | 4.9 ms   | 1.4x faster     |

The two GIN opclasses extract keys differently. `pg_trgm` builds trigrams from
alphanumeric words only (never spanning `_`, `=`, …) and prunes extracted
trigrams under a fixed penalty budget tuned for natural-language text;
`gin_re2_ops` keeps every byte trigram of each literal atom RE2's `FilteredRE2`
requires. On punctuated machine-text patterns (e.g. `error_code=42[0-9]` over
loglines where `error_code=` appears in every row) pruning can leave `pg_trgm`
with only ubiquitous trigrams, degenerating to a full-index scan while
`gin_re2_ops` stays selective, an order of magnitude faster. On plain-word
patterns both extract similar keys and `pg_trgm`'s cheaper consistent check
can win (see `error_code=123` above).

Methodology
-----------

- JIT and query parallelism disabled to compare single-thread engine throughput reliably
- `gen_graph.py` takes the median time per (pattern, engine) across all iterations
- Index scans use a `text_pattern_ops` b-tree and two GIN indexes on one table;
  `enable_seqscan` is off there so both engines are measured on their index

Running
-------

Requires `re2` (see [README]) and PostgreSQL 15+ for builtin comparisons.
The index-scan section additionally needs the `pg_trgm` contrib extension;
`setup.sql` creates it.

Connection uses libpq environment variables; override the `psql` binary with
`PSQL`:

``` sh
PGDATABASE=mydb ./run_bench.sh        # 5 iterations (default)
PGDATABASE=mydb ./run_bench.sh 10     # 10 iterations
./gen_graph.py                        # regenerate graph.png & graph_index.png
```

  [README]: ../README.md
  [Index Support]: ../doc/re2.md#index-support