[← Back to Blog Index](https://trickle-labs.github.io/pg-trickle/blog/) | [Documentation](https://trickle-labs.github.io/pg-trickle/)

# Why pg_trickle + DuckLake Is the Missing Piece for Lakehouse IVM

## DuckLake's own roadmap lists "materialized views and incremental maintenance" as a future feature. pg_trickle already delivers it — from inside the same database.

---

The data lakehouse movement promised to solve a real problem: the cost and
complexity of maintaining separate OLTP and OLAP systems. Formats like Apache
Iceberg and Delta Lake got the storage and cataloging right. But they left one
major gap: **incremental view maintenance** (IVM). The ability to keep a
pre-computed aggregation fresh as the underlying data changes — without
recomputing from scratch every time.

DuckLake v1.0, released in April 2026, makes a bold architectural bet: the
catalog lives in SQL. Not in JSON manifests on S3. Not in a proprietary REST
service. In a standard SQL database — most often PostgreSQL. That single
architectural choice changes what is possible for IVM in a way that no previous
lakehouse format has.

This post explains why, and why pg_trickle is the natural answer.

---

## The IVM Gap in Lakehouse Formats

Every modern analytics workload needs aggregated results:

- Monthly revenue by product category
- Customer lifetime value, segmented by cohort
- Rolling 7-day click-through rates
- Per-tenant storage consumption

The traditional answer is "run a batch job once a day and cache the results."
The modern answer is "use a materialized view and refresh it on a schedule."
Both answers share the same flaw: **they recompute from the full dataset on
every refresh cycle.**

A table with 100 million rows that receives 10,000 new rows per minute does not
need its aggregates recomputed from all 100 million rows every minute. It needs
the aggregates updated to reflect the 10,000 new rows — the delta. IVM is the
technique that does this correctly and efficiently.

Doing IVM correctly is hard. You need to:
1. Detect exactly which rows changed (CDC).
2. Determine which downstream aggregations are affected by those changes.
3. Apply the algebraic inverse of the aggregate function to remove old
   contributions and add new ones — without reading the rows that didn't change.
4. Handle JOINs (which can produce phantom rows when joined tables change).
5. Handle HAVING clauses, subqueries, and complex SQL correctly.

Every major streaming SQL engine — Flink, Materialize, RisingWave, Ksql — has
spent years getting these rules right. They are all correct. They are all also
**separate, external systems** that you must deploy, operate, and connect to
your lake via a CDC pipeline.

---

## Why DuckLake Changes the Equation

Apache Iceberg and Delta Lake store their catalog in files on S3 — JSON
manifests, Avro metadata, snapshot pointers scattered across thousands of small
files. The catalog is not SQL-accessible. To build IVM on top of Iceberg, you
need to:

1. Poll the S3 catalog for new snapshot pointers.
2. Read the new metadata files from S3.
3. Decode which data files are new.
4. Read those Parquet files from S3.
5. Apply the delta.

Steps 1–4 happen before you even see the changed rows. This is why every IVM
system built on Iceberg uses Kafka as a mediator — you emit change events from
step 3 into a Kafka topic and let the stream processor handle step 5.

DuckLake eliminates steps 1–4. When DuckLake commits a snapshot, the catalog
update is a few SQL `INSERT` statements into regular PostgreSQL tables. The
changed rows are already visible in the same PostgreSQL database. The snapshot
pointer is a PostgreSQL row. The data file list is a PostgreSQL table.

If the IVM engine lives in the same PostgreSQL instance — and pg_trickle does —
there is no poll cycle, no S3 read, no Kafka topic. The IVM engine can watch
the DuckLake catalog tables directly, using PostgreSQL's own trigger mechanism
or WAL-based CDC, and maintain aggregations in the same transaction.

---

## How pg_trickle Fills the Gap

pg_trickle is a PostgreSQL 18 extension that maintains SQL query results
incrementally. You define a query once:

```sql
SELECT pgtrickle.create_stream_table(
    name         => 'revenue_by_product',
    query        => $$
        SELECT product_id, SUM(amount) AS total_revenue
        FROM events_bridge
        GROUP BY product_id
    $$,
    schedule     => '5s',
    refresh_mode => 'DIFFERENTIAL'
);
```

From that point on, `revenue_by_product` is a regular PostgreSQL table that
stays fresh. When new rows arrive in `events_bridge`, pg_trickle processes only
those rows — not the entire table. The differential dataflow engine applies the
delta rules for each SQL operator (SUM, COUNT, JOIN, GROUP BY) and updates the
result table in place.

The DuckLake integration requires no new pg_trickle features. DuckLake writes
rows into PostgreSQL tables (either as inlined data, or via its metadata
catalog). pg_trickle watches those tables and maintains aggregations. The entire
IVM pipeline runs inside PostgreSQL, with zero external infrastructure.

---

## What DuckLake's Own Roadmap Says

From the DuckLake public roadmap, under "Future Features":

> **Materialized views and incremental maintenance**
> Pre-computed views that are automatically kept in sync with the underlying
> data. *Currently looking for funding.*

DuckLake's team knows this feature is needed. They have not yet built it because
doing IVM correctly is genuinely hard and expensive to maintain. pg_trickle has
tackled exactly this problem across more than 60 incremental releases: a
differential dataflow engine with operator-by-operator differentiation rules,
frontier-based version tracking, a full CDC pipeline, and a DAG scheduler —
all battle-tested across more than 1,300 test cases and validated against the
TPC-H benchmark suite. The project is pre-1.0 and actively advancing toward its stable release, and
the IVM engine is substantially complete.

The natural conclusion: for PostgreSQL-backed DuckLake deployments, pg_trickle
is the IVM engine DuckLake's roadmap is pointing at.

---

## Comparison with Alternative Approaches

| Approach | Works with DuckLake? | External infrastructure? | IVM quality |
|----------|---------------------|--------------------------|-------------|
| Batch recompute (pg_cron) | Yes | No | None (full scan) |
| Materialize / RisingWave | Yes, with Kafka CDC | Yes (separate cluster) | Excellent |
| Apache Flink | Yes, with Kafka CDC | Yes (JVM cluster) | Excellent |
| Snowflake Dynamic Tables | No (different system) | Snowflake account | Good |
| **pg_trickle** | **Yes, natively** | **No** | **Excellent** |

The key differentiator: pg_trickle is the only IVM engine that runs inside the
same database that hosts the DuckLake catalog. No CDC pipeline, no message
broker, no separate cluster, no cross-network latency.

---

## The Scalability Argument

DuckLake's design principle is "the catalog is small, the data is large." A
catalog with 1 million snapshots and 10 million data files still fits in a few
gigabytes of PostgreSQL storage. The Parquet data files on S3 might be
petabytes.

pg_trickle's IVM runs on the catalog side. The delta processing touches only
the PostgreSQL metadata — specifically, only the rows that changed in the last
refresh cycle. The physical Parquet files on S3 are never read by the IVM
engine.

For a DuckLake table that receives 1,000 row insertions per minute, the IVM
cost is proportional to 1,000 rows — not the total table size. For a 100-billion-
row DuckLake table, the IVM cost is still proportional to the delta, not the
history. This is the fundamental O(Δ) scaling property that makes pg_trickle
viable at data lake scale.

---

## What Comes Next

This release (v0.64.0) establishes the narrative and demonstrates the
integration with tutorials, demos, and documentation. The current integration
uses the foreign-table polling path — pg_trickle watches a bridge table that
mirrors DuckLake writes.

The next phase (v0.65.0) will deliver a native DuckLake change-feed adapter
that consumes DuckLake's `table_changes()` function directly, eliminating the
polling overhead and enabling sub-second latency even for large DuckLake tables.
The Phase 2 adapter will turn pg_trickle into a true production-grade IVM
engine for DuckLake — no bridge table, no polling, no full scans.

---

## Summary

DuckLake made the lakehouse catalog SQL-accessible. That one architectural
decision creates the conditions for genuine incremental view maintenance, because
the catalog, the delta, and the IVM engine can all run in the same database
process. pg_trickle is that IVM engine.

DuckLake's roadmap says "looking for funding" for IVM. pg_trickle has already
built it. The two projects fit together as naturally as they do because they made
the same fundamental bet: SQL-first, PostgreSQL-inside, no external infrastructure
required.

If you run DuckLake on PostgreSQL, you already have access to the best IVM
engine available for your data lake. It is called pg_trickle, and it is free.

---

*See also:*
- [Tutorial: Real-Time Dashboards on Your Data Lake](ducklake-real-time-dashboards.md)
- [Tutorial: The Modern Data Stack in One Box](ducklake-modern-data-stack.md)
- [Blog: DuckLake's `table_changes()` Meets pg_trickle's DVM Engine](ducklake-table-changes-dvm.md)
- [Foreign Tables as Stream Table Sources](foreign-table-sources.md)