# storage_engine A PostgreSQL extension providing two high-performance Table Access Methods designed for analytical and HTAP workloads. - **`colcompress`** — column-oriented compressed storage with vectorized execution and parallel scan - **`rowcompress`** — row-oriented batch-compressed storage with parallel scan Both AMs coexist alongside standard heap tables in the same database. All catalog objects are isolated in the `engine` schema, making the extension safe to install alongside `citus_columnar` or any other columnar extension (all exported C symbols carry the `se_` prefix to avoid linker conflicts). > **Lineage:** `storage_engine` is a fork of [Hydra Columnar](https://github.com/hydradatabase/hydra) (itself derived from [citus_columnar](https://github.com/citusdata/citus)), extended with `rowcompress`, full DELETE/UPDATE support, stripe-level min/max pruning, and a redesigned parallel scan. The MergeTree-style `orderby` option and zone-map pruning are directly inspired by [ClickHouse](https://clickhouse.com/docs/engines/table-engines/mergetree-family/mergetree). --- ## Table of Contents - [Quick Start](#quick-start) - [colcompress AM](#colcompress-am) - [How It Works](#how-it-works) - [Column-Level Caching](#column-level-caching) - [Vectorized Execution](#vectorized-execution) - [Parallel Scan](#parallel-scan) - [Stripe-Level and Chunk-Level Min/Max Pruning](#stripe-level-and-chunk-level-minmax-pruning) - [Index-Backed Scan](#index-backed-scan) - [DELETE and UPDATE Support](#delete-and-update-support) - [ON CONFLICT / Upserts](#on-conflict--upserts) - [MergeTree-Like Ordering and colcompress_merge](#mergetree-like-ordering-and-colcompress_merge) - [Compression Options](#compression-options) - [Configuration GUCs](#configuration-gucs) - [Per-Table Options](#per-table-options) - [rowcompress AM](#rowcompress-am) - [How It Works](#how-it-works-1) - [Parallel Scan](#parallel-scan-1) - [Per-Table Options](#per-table-options-1) - [Management Functions](#management-functions) - [Catalog Views](#catalog-views) - [Installation](#installation) - [PostgreSQL Version Compatibility](#postgresql-version-compatibility) - [Known Limitations](#known-limitations) - [Benchmarks](#benchmarks) - [Attribution](#attribution) --- ## Quick Start ```sql CREATE EXTENSION storage_engine; -- Column-oriented analytics table CREATE TABLE events ( ts timestamptz NOT NULL, user_id bigint, event_type text, value float8 ) USING colcompress; -- Row-oriented compressed table (good for append-heavy workloads) CREATE TABLE logs ( id bigserial, logged_at timestamptz NOT NULL, message text ) USING rowcompress; -- Insert normally INSERT INTO events SELECT now(), g, 'click', random() FROM generate_series(1, 1000000) g; -- Query normally — column projection, vectorized execution, -- and parallel scan are transparent and automatic. SELECT event_type, count(*), avg(value) FROM events WHERE ts > now() - interval '1 day' GROUP BY 1; ``` --- ## colcompress AM ### How It Works Data is stored **column by column** on disk. Each column is split into *stripes* (default 150,000 rows each), and each stripe is further divided into *chunk groups* (default 10,000 rows). Every chunk records the minimum and maximum value of its column, enabling zone-map pruning at query time. ``` Table file ├── Stripe 1 (rows 1 – 150,000) │ ├── Chunk group 0 (rows 1 – 10,000) │ │ ├── Column A [min, max, compressed values…] │ │ ├── Column B [min, max, compressed values…] │ │ └── … │ └── Chunk group 1 (rows 10,001 – 20,000) … └── Stripe 2 (rows 150,001 – 300,000) … ``` A scan only reads the columns referenced by the query, skipping all others entirely. This dramatically reduces I/O for wide tables with selective column access patterns. ### Column-Level Caching The AM maintains an in-memory **column cache** that stores decompressed column chunks across executor iterations. When the same stripe region is accessed more than once (nested loops, repeated plans, self-joins), the decompressed data is served from cache without re-reading or re-decompressing the file. ```sql SET storage_engine.enable_column_cache = on; -- default: on ``` ### Vectorized Execution `colcompress` ships a **vectorized expression evaluation engine** that processes WHERE clauses and aggregates in column-oriented batches of up to 10,000 values per call, instead of row-at-a-time evaluation. This maps naturally onto column chunks and eliminates per-row interpreter overhead. Supported vectorized operations: | Category | Types | |---|---| | Comparison operators (`=`, `<>`, `<`, `<=`, `>`, `>=`) | `int2`, `int4`, `int8`, `float4`, `float8`, `date`, `timestamp`, `timestamptz`, `char`, `bpchar`, `text`, `varchar`, `name`, `bool`, `oid` | | Aggregates (`count`, `sum`, `avg`, `max`, `min`) | `int2`, `int4`, `int8`, `float8`, `date`, `timestamptz` | ```sql SET storage_engine.enable_vectorization = on; -- default: on ``` ### Parallel Scan The AM implements the full PostgreSQL parallel Table AM protocol using **Dynamic Shared Memory (DSM)**. The coordinator divides the stripe range across workers; each worker independently reads and decompresses its assigned stripes, then feeds results back through the Gather node. Parallel scan stacks on top of vectorized execution — each worker runs its own vectorized evaluation pipeline independently. ```sql SET storage_engine.enable_parallel_execution = on; -- default: on SET storage_engine.min_parallel_processes = 8; -- minimum workers (default: 8) -- Standard PostgreSQL parallel knobs also apply: SET max_parallel_workers_per_gather = 4; SET parallel_setup_cost = 0; SET parallel_tuple_cost = 0; ``` ### Stripe-Level and Chunk-Level Min/Max Pruning `colcompress` implements **two layers** of zone-map pruning using the `minimum_value` / `maximum_value` statistics stored per chunk in `engine.chunk`. **Stripe-level pruning (coarse)** — Before reading any data, the scan aggregates the min/max across all chunks of each stripe and tests the resulting stripe-wide range against the query's WHERE predicates using PostgreSQL's `predicate_refuted_by`. Any stripe whose range is provably disjoint from the predicate is skipped entirely — no decompression, no I/O. The number of stripes removed this way is reported in `EXPLAIN`: ```sql EXPLAIN (ANALYZE, BUFFERS) SELECT count(*) FROM events WHERE ts > '2025-01-01'; -- Custom Scan (ColumnarScan) on events ... -- Engine Stripes Removed by Pruning: 41 -- Engine Stripes Read: 12 ``` Stripe pruning is most effective after `engine.colcompress_merge()` has established a global sort order, but it also works on any partially sorted data. **Chunk-level pruning (fine)** — Within each stripe that survives the coarse filter, the custom scan evaluates each individual chunk group's min/max range against the same predicates. Chunk groups whose range cannot satisfy the predicate are skipped. The two layers compose: a query on a large, well-sorted table typically eliminates entire stripes before touching them, then further prunes chunk groups within the remaining stripes, resulting in very small I/O amplification even without an index. Pruning effectiveness scales directly with data sortedness. Use the `orderby` option combined with `engine.colcompress_merge()` to establish global sort order and maximize pruning at both levels. ### Index-Backed Scan A custom index scan path allows B-tree and other indexes to drive lookups into a `colcompress` table, decompressing only the rows matched by the index. ```sql -- Session GUC (applies to all colcompress tables in the session) SET storage_engine.enable_columnar_index_scan = on; -- default: off -- Per-table override (persisted in engine.col_options, survives reconnect) SELECT engine.alter_colcompress_table_set('documents'::regclass, index_scan => true); ``` The right default depends on the access pattern: | Workload | Recommendation | |---|---| | Analytical (aggregates, range scans, high column count) | Keep `off` — the sequential vectorized path + chunk pruning wins | | Document repository (XML, PDF, JSON blobs stored for compression, fetched by primary key or unique identifier) | Set `on` — you want columnar compression with point-lookup speed, not full-table scan | Index scan is enabled for a given query if **either** the session GUC is `on` **or** the table's `index_scan` option is `true`. This lets you keep the GUC `off` globally while enabling it selectively per table: ```sql -- OLAP tables: keep default (index_scan = false) -- Document table: enable permanently SELECT engine.alter_colcompress_table_set('adm.documents'::regclass, index_scan => true); -- Inspect SELECT table_name, index_scan FROM engine.colcompress_options WHERE table_name = 'adm.documents'; -- table_name | index_scan -- ---------------+----------- -- adm.documents | t ``` For document repositories the combination is compelling: `colcompress` with `zstd` can compress large binary/text documents 3–10×, and with the index scan enabled, retrievals like `WHERE id = $1` or `WHERE document_key = $1` decompress only the matching rows without touching the rest of the stripe. > **When to use `index_scan = true` vs `colcompress_merge()`** > > These are two distinct use cases, not competing strategies: > > | Use case | Strategy | > |---|---| > | **File/document storage** (XML, PDF, JSON blobs — fetched by primary key or unique key) | `index_scan = true`. You want columnar compression for storage savings and point-lookup speed without full-stripe decompression. Sort order is irrelevant; every fetch targets a specific row. | > | **Analytics** (aggregations, date ranges, GROUP BY, pattern scans over millions of rows) | `index_scan = false` + `colcompress_merge()`. Sort the data by the query's filter column (`orderby = 'event_date ASC'`), then merge to produce globally ordered stripes. Stripe-level min/max pruning skips irrelevant stripes entirely before any decompression occurs. | > > Mixing both on the same table is possible but not ideal: a B-tree index on the `orderby` column will cause the planner to prefer `IndexScan` for range queries, disabling stripe pruning (see [Known Limitations](#b-tree-indexes-on-colcompress-disable-stripe-pruning)). If you need occasional point lookups on an analytical table, rely on the GUC `SET storage_engine.enable_columnar_index_scan = on` at session level rather than creating a B-tree index. ### DELETE and UPDATE Support `colcompress` fully supports `DELETE` and `UPDATE` via a **row mask** stored in `engine.row_mask`. Each deleted row is marked as a bit in a per-chunk-group bitmask; the scan engine skips masked rows without rewriting the stripe. `UPDATE` is implemented as a delete-then-insert. ```sql SET storage_engine.enable_dml = on; -- default: on DELETE FROM events WHERE ts < now() - interval '1 year'; UPDATE events SET value = value * 1.1 WHERE event_type = 'purchase'; ``` Deleted rows are reclaimed during `VACUUM`, which rewrites affected stripes and clears the row mask. ### ON CONFLICT / Upserts Standard `INSERT … ON CONFLICT` is fully supported, including `DO NOTHING` and `DO UPDATE SET …`. Requires a unique index on the conflict target column(s). ```sql INSERT INTO events (ts, user_id, event_type, value) VALUES (now(), 42, 'click', 1.0) ON CONFLICT (user_id, event_type) DO UPDATE SET value = EXCLUDED.value; ``` ### MergeTree-Like Ordering and colcompress_merge Inspired by ClickHouse's MergeTree engine, `colcompress` supports a **global sort key** per table. When set, every new stripe written to the table is sorted by the key before compression. The `engine.colcompress_merge()` function rewrites the entire table in a single globally sorted pass, making stripe-level and chunk-level min/max pruning maximally effective across all stripes. ```sql -- Assign a sort key to an existing table SELECT engine.alter_colcompress_table_set( 'events'::regclass, orderby => 'ts ASC, user_id ASC' ); -- After a large INSERT or initial load, compact and globally sort: SELECT engine.colcompress_merge('events'); -- Now WHERE ts BETWEEN ... skips almost all chunks. ``` `colcompress_merge` internally: 1. Copies all live (non-deleted) rows to a temporary heap table 2. Truncates the target table 3. Re-inserts rows in the order defined by `orderby`, writing fresh globally-ordered stripes ### Compression Options Each chunk is compressed independently. Available algorithms: | Name | Description | |---|---| | `none` | No compression | | `lz4` | Fast compression/decompression, moderate ratio | | `zstd` | High ratio, configurable level 1–19 (default level: 3) | | `pglz` | PostgreSQL's built-in LZ variant | ```sql SELECT engine.alter_colcompress_table_set('events', compression => 'zstd', compression_level => 9); ``` ### Configuration GUCs All parameters can be set per-session or globally in `postgresql.conf`. > **GUC availability:** The `storage_engine.*` GUCs are registered when the `storage_engine` shared library is loaded. This happens automatically the first time a `colcompress` or `rowcompress` table is accessed in a session. To make GUCs available immediately in every session (including before any table access), add the extension to `shared_preload_libraries`: > ``` > shared_preload_libraries = 'storage_engine' > ``` > GUC names use the `storage_engine.` prefix and do not overlap with `citus_columnar` (which uses the `columnar.` prefix), so both extensions can be loaded simultaneously without conflict. | Parameter | Type | Default | Description | |---|---|---|---| | `storage_engine.enable_parallel_execution` | bool | `on` | Enable parallel custom scan via DSM | | `storage_engine.min_parallel_processes` | int | `8` | Minimum worker count to launch for parallel scan | | `storage_engine.enable_vectorization` | bool | `on` | Enable vectorized WHERE/aggregate evaluation | | `storage_engine.enable_custom_scan` | bool | `on` | Enable projection and qual-pushdown custom scan | | `storage_engine.enable_column_cache` | bool | `on` | Enable in-memory column chunk cache | | `storage_engine.enable_columnar_index_scan` | bool | `off` | Enable index-driven columnar scan path (recommended `on` for document/point-lookup repositories) | | `storage_engine.enable_dml` | bool | `on` | Allow DELETE and UPDATE on colcompress tables | | `storage_engine.stripe_row_limit` | int | `150000` | Default rows per stripe (1,000 – 100,000,000) | | `storage_engine.chunk_group_row_limit` | int | `10000` | Default rows per chunk group (1,000 – 100,000,000) | | `storage_engine.compression_level` | int | `3` | Default compression level for zstd (1 – 19) | ### Per-Table Options Per-table options override the session GUCs for a specific table: ```sql SELECT engine.alter_colcompress_table_set( 'events'::regclass, chunk_group_row_limit => 10000, -- NULL = leave unchanged stripe_row_limit => 150000, compression => 'zstd', compression_level => 9, orderby => 'ts ASC, user_id ASC', index_scan => false -- true = skip cost penalty for index scans on this table ); -- Reset individual options to system defaults SELECT engine.alter_colcompress_table_reset( 'events'::regclass, compression => true, compression_level => true, index_scan => true -- resets to false ); -- Inspect current options SELECT * FROM engine.colcompress_options WHERE table_name = 'events'; ``` --- ## rowcompress AM ### How It Works `rowcompress` stores rows in fixed-size **batches** (default 10,000 rows per batch). Each batch is serialized using heap tuple format and then compressed as a single unit. Batch metadata (file offset, byte size, first row number, row count) is stored in `engine.row_batch`. ``` Table file (ColumnarStorage layout) ├── Batch 1: [header | row offsets | compressed heap tuple data] ├── Batch 2: [header | row offsets | compressed heap tuple data] └── … ``` This AM suits **append-heavy workloads where compression matters but column projection is not needed** — event logs, audit trails, time-series with many columns all queried together. Typical storage savings of 2–10× with zstd. Compared to `colcompress`: - Reads full rows (no column projection) - Lower write latency per row (no columnar transposition) - No vectorized execution or chunk-level pruning - Supports parallel reads and index scans - Supports multiple compression algorithms ### Parallel Scan `rowcompress` implements the PostgreSQL parallel scan protocol via **atomic batch claiming**. Each parallel worker atomically increments a shared counter to claim the next unprocessed batch, decompresses it, and repeats. There is no coordinator or work distribution step — workers self-schedule in a work-stealing fashion with zero contention on most paths. ```sql -- Standard PostgreSQL parallel knobs apply SET max_parallel_workers_per_gather = 4; ``` ### Per-Table Options ```sql SELECT engine.alter_rowcompress_table_set( 'logs'::regclass, batch_size => 10000, -- rows per compressed batch (default: 10000) compression => 'zstd', compression_level => 5 ); -- Reset to defaults SELECT engine.alter_rowcompress_table_reset('logs'::regclass, compression => true); -- Rewrite all batches with current options (e.g. after changing compression) SELECT engine.rowcompress_repack('logs'); -- Inspect SELECT * FROM engine.rowcompress_options WHERE table_name = 'logs'; SELECT * FROM engine.rowcompress_batches LIMIT 10; ``` --- ## Management Functions | Function | Description | |---|---| | `engine.alter_colcompress_table_set(regclass, ...)` | Set one or more options on a colcompress table | | `engine.alter_colcompress_table_reset(regclass, ...)` | Reset colcompress options to system defaults | | `engine.colcompress_merge(regclass)` | Rewrite and globally sort a colcompress table by its `orderby` key | | `engine.colcompress_repack(regclass)` | Alias for `colcompress_merge`; drop-in replacement for `pg_repack` on colcompress tables | | `engine.alter_rowcompress_table_set(regclass, ...)` | Set one or more options on a rowcompress table | | `engine.alter_rowcompress_table_reset(regclass, ...)` | Reset rowcompress options to system defaults | | `engine.rowcompress_repack(regclass)` | Rewrite all batches of a rowcompress table with current options | --- ## Catalog Views | View | Description | |---|---| | `engine.colcompress_options` | Per-table options for all colcompress tables | | `engine.colcompress_stripes` | Stripe-level metadata (offset, size, row range) per table | | `engine.rowcompress_options` | Per-table options for all rowcompress tables | | `engine.rowcompress_batches` | Batch-level metadata for all rowcompress tables | All views grant `SELECT` to `PUBLIC`. --- ## Installation ### Build from source Requires PostgreSQL server headers and `pg_config` in `PATH`. ```bash cd dist/ sudo make -j$(nproc) install ``` Add to `postgresql.conf`: ``` shared_preload_libraries = 'storage_engine' ``` #### Loading with citus If `citus` or `pg_cron` is also in `shared_preload_libraries`, the load order matters: ``` shared_preload_libraries = 'pg_cron,citus,storage_engine' ``` `citus` **must appear before** `storage_engine`. PostgreSQL registers planner hooks in load order; citus expects to be the outermost hook in the chain. Reversing the order causes PostgreSQL to refuse to start. Restart PostgreSQL and load the extension: ```sql CREATE EXTENSION storage_engine; ``` ### Docker ```bash docker compose up psql postgres://postgres:password@127.0.0.1:5432 ``` The `engine` schema and both AMs are created automatically on `CREATE EXTENSION`. --- ## Known Limitations ### No AFTER ROW triggers (and no pg_repack) `colcompress` and `rowcompress` tables do not support **AFTER ROW triggers** or **foreign keys**. This is a fundamental architectural constraint: columnar storage does not maintain heap tuples in a form that row-level trigger machinery can inspect. This means **`pg_repack` cannot be used** — it relies internally on an AFTER ROW trigger to capture concurrent changes during its online copy phase: ``` ERROR: Foreign keys and AFTER ROW triggers are not supported for columnar tables DETAIL: Tools such as pg_repack use AFTER ROW triggers internally and cannot be used with colcompress or rowcompress tables. HINT: Use engine.colcompress_repack(table) as a drop-in replacement for pg_repack on colcompress tables. ``` Use `engine.colcompress_repack()` instead: ```sql -- Drop-in replacement for: pg_repack -t mytable SELECT engine.colcompress_repack('adm.documents'::regclass); -- Repack all colcompress tables in the database: DO $$ DECLARE r record; BEGIN FOR r IN SELECT n.nspname, c.relname FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace JOIN pg_am a ON c.relam = a.oid WHERE a.amname = 'colcompress' AND c.relkind = 'r' ORDER BY pg_total_relation_size(c.oid) DESC LOOP RAISE NOTICE 'Repacking %.%...', r.nspname, r.relname; PERFORM engine.colcompress_repack( (quote_ident(r.nspname) || '.' || quote_ident(r.relname))::regclass); END LOOP; END; $$; ``` > **Limitation vs. pg_repack:** `colcompress_repack` acquires `AccessExclusiveLock` for the duration of the operation (reads and writes are blocked). There is no online/concurrent mode. Schedule during a maintenance window for large tables. ### AFTER STATEMENT triggers are supported Only row-level (`FOR EACH ROW`) AFTER triggers are blocked. Statement-level (`FOR EACH STATEMENT`) AFTER triggers work fine. ### sort-on-write is disabled when indexes exist When a `colcompress` table has both an `orderby` option and one or more B-tree indexes, the sort-on-write path is automatically disabled during writes. This avoids a TID placeholder problem that would corrupt index entries. The table's stripes are still read in their natural write order; run `engine.colcompress_merge()` (or `colcompress_repack()`) to produce globally sorted stripes after loading data. ### B-tree indexes on colcompress disable stripe pruning When a B-tree index exists on a `colcompress` column, the PostgreSQL planner may choose `IndexScan` even for queries that would benefit from a full sequential scan with stripe pruning. `IndexScan` opens the table with `randomAccess=true`, which bypasses the stripe pruning code path — causing date-range and similar ordered queries to read all stripes instead of only the relevant one. **For analytical tables: do not create B-tree indexes on columns covered by the `orderby` key.** Use GIN indexes for JSONB and array columns, and rely on stripe pruning for range predicates on the sort key. If point-lookup access is occasionally needed on an analytical colcompress table, set `index_scan = false` explicitly and use `engine.colcompress_merge()` to keep the data globally sorted. Examples of problematic and safe index patterns: ```sql -- BAD: B-tree index on the ordery column defeats stripe pruning on range queries CREATE INDEX ON events_col (event_date); -- do NOT do this -- GOOD: GIN indexes for JSONB / array columns are fine CREATE INDEX ON events_col USING gin (metadata jsonb_path_ops); CREATE INDEX ON events_col USING gin (tags); -- Ensure the planner does not choose IndexScan SELECT engine.alter_colcompress_table_set('events_col'::regclass, index_scan => false); ``` ### No CLUSTER support `CLUSTER` (index-ordered physical rewrite) is not implemented for columnar tables. Use `engine.colcompress_merge()` with an `orderby` option to achieve equivalent physical ordering. ### No VACUUM FULL / table rewrite `VACUUM FULL` triggers a table rewrite that is not implemented for colcompress or rowcompress tables. Use `engine.colcompress_repack()` / `engine.rowcompress_repack()` instead, which perform the same compaction without the PostgreSQL rewrite machinery. ### No unlogged tables `CREATE UNLOGGED TABLE ... USING colcompress` is not supported (`ERRCODE_FEATURE_NOT_SUPPORTED`). Use the default `LOGGED` persistence. ### No speculative insertion (INSERT … ON CONFLICT on non-unique predicates) Speculative insertion (`INSERT … ON CONFLICT`) requires a unique index on the conflict target. Conflict detection on arbitrary predicates or without an index is not supported. --- ## Benchmarks Benchmark suite: 1 000 000 rows, PostgreSQL 18.3, AMD Ryzen 7 5800H (8-core), 40 GB RAM, `shared_buffers=10GB`. `colcompress` configured with `lz4` compression and `orderby = 'event_date ASC'` (globally sorted via `colcompress_merge`). Two scenarios are measured: | Scenario | Settings | |---|---| | **Serial** (storage baseline) | `JIT=off`, `max_parallel_workers_per_gather=0` | | **Parallel** (real-world simulation) | `JIT=on`, `max_parallel_workers_per_gather=16` | ### Serial results — median of 3 runs | Query | heap | colcompress | rowcompress | citus_columnar | |---|---|---|---|---| | Q1 `count(*)` | 38.6ms | 43.7ms | 305ms | 36.9ms | | Q2 `SUM/AVG` numeric+double | 182.3ms | 118.3ms | 356ms | 121.4ms | | Q3 `GROUP BY` country (10 vals) | 214.4ms | 162.3ms | 382ms | 141.4ms | | Q4 `GROUP BY` event_type + p95 | 538.2ms | 452.5ms | 680ms | 469.9ms | | **Q5 date range 1 month** | 21.1ms | **23.5ms** | 60.0ms | 21.1ms | | Q6 JSONB `@>` GIN | 121.7ms | 371.4ms | 322ms | 236.9ms | | Q7 JSONB key + GROUP BY | 386.2ms | 309.0ms | 537ms | 354.4ms | | Q8 array `@>` GIN | 61.3ms | 329.2ms | 272ms | 143.8ms | | Q9 LIKE text scan | 147.0ms | 88.3ms | 333ms | 90.3ms | | Q10 heavy multi-agg | 1908ms | 1902ms | 2067ms | 1914ms | Q5 on `colcompress` achieves heap-equivalent performance (23.5ms vs 21.1ms heap) because stripe pruning skips 6 of 7 stripes — data is physically sorted by `event_date` via the `orderby` option. lz4 decompression adds negligible overhead over zstd for this stripe-pruned workload while reducing merge time. ### Parallel results — median of 3 runs (JIT on, 16 workers) | Query | heap | colcompress | rowcompress | citus_columnar | |---|---|---|---|---| | Q1 `count(*)` | 17.8ms | 16.3ms | 144ms | 37.0ms | | Q2 `SUM/AVG` numeric+double | 50.1ms | 30.9ms | 142ms | 121.7ms | | Q3 `GROUP BY` country (10 vals) | 57.6ms | 171ms | 151ms | 138ms | | Q4 `GROUP BY` event_type + p95 | 539ms | 329ms | 686ms | 473ms | | Q5 date range 1 month | 21.2ms | 242ms | 69.5ms | 21.0ms | | Q6 JSONB `@>` GIN | 84.5ms | 42.8ms | 465ms | 235ms | | Q7 JSONB key + GROUP BY | 391ms | 87.7ms | 692ms | 349ms | | Q8 array `@>` GIN | 61.7ms | 33.3ms | 275ms | 147ms | | Q9 LIKE text scan | 48.7ms | 26.8ms | 140ms | 91.0ms | | Q10 heavy multi-agg | 1951ms | **691ms** | 2085ms | 1958ms | Note: Q5 shows higher latency for `colcompress` in parallel mode. Each parallel worker scans its assigned block range independently without a global stripe-pruning pass, so all stripes are read. Stripe pruning is effective only in the single-process sequential scan path. ### Reproducing the results The full benchmark kit is in `tests/bench/`: ```bash createdb bench_am psql -d bench_am -f tests/bench/setup.sql # Serial run bash tests/bench/run.sh 3 python3 tests/bench/chart.py # Parallel run bash tests/bench/run_parallel.sh 3 python3 tests/bench/chart_parallel.py ``` See [tests/README.md](tests/README.md) for full environment description and step-by-step instructions. > Benchmark results above correspond to version 1.0.4 with `lz4` compression and globally sorted stripes. --- ## PostgreSQL Version Compatibility | PostgreSQL | Status | |---|---| | 13 | Supported | | 14 | Supported | | 15 | Supported | | 16 | Supported | | 17 | Supported | | 18 | Supported (current development target) | --- ## Attribution `storage_engine` is a fork of **[Hydra Columnar](https://github.com/hydradatabase/hydra)**, which is itself derived from **[citus_columnar](https://github.com/citusdata/citus)** — the columnar extension originally built by Citus Data (now part of Microsoft). The original work is copyright Citus Data / Hydra and licensed under the AGPL-3.0. Key ideas borrowed or adapted from other projects: | Inspiration | Feature | |---|---| | [ClickHouse MergeTree](https://clickhouse.com/docs/engines/table-engines/mergetree-family/mergetree) | Per-table `orderby` sort key; zone-map (min/max) pruning at stripe and chunk level | | [Apache Parquet](https://parquet.apache.org/) | Row-group statistics; column projection; dictionary encoding | | [DuckDB](https://duckdb.org/) | Vectorized expression evaluation for columnar batches | We are grateful to the Hydra and Citus teams for making their work open source.