# Demo: OLTP-to-Lake Loop

*The full modern data stack in a single `docker compose up` — PostgreSQL writes to a DuckLake in MinIO, DuckDB reads it*

---

## What you'll build

You'll run a complete OLTP-to-data-lake pipeline that turns raw e-commerce orders
into a continuously-updated `revenue_by_region` analytics table queryable from
DuckDB — all on your laptop, with no cloud accounts, no Kafka, and no Spark.

A Python order generator inserts 10 orders per second into PostgreSQL. A
pg_trickle stream table aggregates those orders by region and minute,
differentially — touching only the rows that changed since the last cycle. Every
5 seconds, each delta is written as a Parquet file to MinIO (your local S3). A
JupyterLab notebook queries the growing history with DuckDB and plots revenue
trends.

**End-to-end latency from ORDER INSERT to DuckLake snapshot: typically under 2 seconds.**

---

## Background: The OLTP-to-lake problem

Most businesses run PostgreSQL (or another OLTP database) for transactional
workloads. But PostgreSQL is not designed for analytical queries that scan
millions of rows. The classic solution is to replicate data to a separate OLAP
system — but that requires CDC pipelines, stream processors, schema registries,
and careful consistency management.

pg_trickle collapses this stack. It sits inside PostgreSQL and maintains an
incrementally-refreshed aggregate (the stream table) that is both:

- **Queryable directly from PostgreSQL** in sub-millisecond reads (operational use)
- **Continuously published to a DuckLake** as Parquet deltas (analytical use)

There's one source of truth, one maintenance mechanism, and no consistency gap
between the operational and analytical views.

---

## Prerequisites

- **Docker Engine 24+** and **Docker Compose v2**
  Verify: `docker compose version`
- **4 GB free RAM** for the four containers
- No local installation of pg_trickle, DuckDB, or Python needed

---

## First startup: Image build

**⚠️ The first `docker compose up` will build the pg_trickle image from source.** This takes 2–3 minutes and requires an internet connection (Rust toolchain download).

**Subsequent restarts will use Docker cache and start in seconds.**

Once built, the image is cached locally.

---

## Architecture

```
Python order generator
  │  10 orders/second
  │  columns: order_id, region, amount, created_at
  ▼
PostgreSQL 18 + pg_trickle  (port 5432, db: postgres)
  │  orders table  ← trigger-based CDC on INSERT
  │
  └─ stream table: revenue_by_region
       query: SELECT region, date_trunc('minute', created_at) AS minute,
                     SUM(amount) AS total_revenue, COUNT(*) AS order_count
              FROM orders GROUP BY region, date_trunc('minute', created_at)
       5-second DIFFERENTIAL refresh
       DuckLake sink → pgtrickle.pgt_ducklake_provenance (provenance log)
       ▼
MinIO  (port 9000 / console 9001)
  s3://pg-trickle-demo/revenue_by_region/
  │  One Parquet delta file per refresh cycle
  │  Only the (region, minute) rows that changed in the last 5 seconds
  ▼
DuckLake catalog  (stored in PostgreSQL)
  ducklake_snapshot, ducklake_view, ducklake_data_file
  ▼
JupyterLab  (port 8888, token: demo)
  oltp_lake_analysis.ipynb
  DuckDB queries the Parquet history for trend analysis
```

---

## Step 1: Start the demo

```bash
cd demos/ducklake-oltp-lake
docker compose up
```

Four containers start:

| Container | Port | Purpose |
|-----------|------|---------|
| `postgres` | 5432 | PostgreSQL 18 + pg_trickle + DuckLake catalog |
| `minio` | 9000 / 9001 | S3-compatible object storage |
| `jupyter` | 8888 | JupyterLab with DuckDB analytics notebook |
| `generator` | — | Python script producing 10 orders/s |

Wait until all containers are healthy:
```
postgres  | LOG: database system is ready to accept connections
jupyter   | [JupyterLab] http://127.0.0.1:8888/lab?token=demo
```

---

## Step 2: Open JupyterLab

Go to **http://localhost:8888/lab?token=demo** and open `oltp_lake_analysis.ipynb`.

The notebook has three sections:

1. **Connect** — attaches DuckDB to the DuckLake catalog in PostgreSQL
2. **Query current revenue** — reads from the stream table directly for
   sub-millisecond results
3. **Analyse the Parquet history** — time-travel queries over the snapshot chain
   to plot how revenue by region evolved over the last N minutes

Run all cells top to bottom with **Run → Run All Cells**.

**Note:** The JupyterLab container includes DuckDB, psycopg2, and S3 tools pre-installed, so you can immediately query Parquet files and PostgreSQL without additional setup.

---

## Step 3: Connect to PostgreSQL and explore the pipeline

In a second terminal:

```bash
psql postgresql://postgres:postgres@localhost:5432/postgres
```

**What's happening behind the scenes:**

- The `orders` table is receiving inserts from the generator (10 orders/second)
- CDC triggers capture each insert into `pgtrickle_changes.changes_*`
- The pg_trickle scheduler runs every 5 seconds (configured schedule) and executes a DIFFERENTIAL refresh
- Each refresh applies the delta changes to `revenue_by_region` and publishes the delta to MinIO

The scheduler is **automatically enabled** via `shared_preload_libraries` in the PostgreSQL Docker configuration, so stream tables begin refreshing immediately at startup.

See the stream table status:

```sql
SELECT pgt_name, pgt_schema, refresh_mode, schedule, status, is_populated, last_refresh_at
FROM pgtrickle.pgt_stream_tables;
```

Expected output:
```
     pgt_name      | pgt_schema | refresh_mode | schedule | status | is_populated | last_refresh_at
-------------------+------------+--------------+----------+--------+--------------+------------------
 revenue_by_region | public     | DIFFERENTIAL | 5s       | ACTIVE | true         | 2026-05-21 09:42:36
```

Query the current revenue by region:

```sql
SELECT region, minute, total_revenue, order_count
FROM revenue_by_region
ORDER BY minute DESC, total_revenue DESC;
```

Watch the provenance trail — one row per Parquet file written to MinIO:

```sql
SELECT
    stream_table_name,
    ducklake_snapshot_id,
    delta_row_count,
    written_at,
    written_at - LAG(written_at) OVER (ORDER BY written_at) AS cycle_duration
FROM pgtrickle.pgt_ducklake_provenance
WHERE stream_table_name = 'revenue_by_region'
ORDER BY written_at DESC
LIMIT 10;
```

`delta_row_count` shows how many (region, minute) rows were updated in each 5-
second cycle. When the generator is running at steady state, this is typically
1–4 rows (only the current minute's region aggregates change), regardless of
how many total orders have been inserted.

---

## Step 4: Explore the Parquet files in MinIO

Open the MinIO console at **http://localhost:9001**.

- Login: `minioadmin` / `minioadmin`
- Navigate to `pg-trickle-demo` → `revenue_by_region/`

A new `.parquet` file appears roughly every 5 seconds (whenever orders have
arrived in the last cycle). Each file is a delta — not a full snapshot of
`revenue_by_region`. The DuckLake catalog in PostgreSQL tracks which files belong
to which snapshot, so DuckDB assembles the correct view when you query.

---

## Step 5: Understand the stream table SQL

The stream table is defined by this call (pre-loaded by `postgres/init.sql`):

```sql
SELECT pgtrickle.create_stream_table(
    'revenue_by_region',
    query => $$
        SELECT
            region,
            date_trunc('minute', created_at) AS minute,
            SUM(amount)                       AS total_revenue,
            COUNT(*)                          AS order_count
        FROM orders
        GROUP BY region, date_trunc('minute', created_at)
    $$,
    schedule           => '5s',
    refresh_mode       => 'DIFFERENTIAL',
    sink               => 'ducklake',
    ducklake_sink_path => 's3://pg-trickle-demo/revenue_by_region/'
);
```

**How DIFFERENTIAL refresh works for `SUM`:**
When new orders arrive, pg_trickle's DVM engine computes:
```
Δ_total_revenue(region, minute) = SUM(amount) over NEW_ORDERS
```
for each `(region, minute)` bucket that received new orders. It adds this delta
to the existing `revenue_by_region` values. Only the affected buckets are updated
— not a full `GROUP BY` recomputation over all orders.

---

## Troubleshooting

| Symptom | Likely cause | Fix |
|---------|-------------|-----|
| `revenue_by_region` is empty after 30 s | Generator not running | Check `docker compose logs generator` |
| Stream table shows `is_populated=true` but 0 rows | Scheduler not enabled (old Docker image) | Ensure `shared_preload_libraries=pg_trickle` is set; run `docker compose down && docker compose up` to rebuild |
| MinIO console shows empty bucket | DuckLake sink S3 write failed | Check `docker compose logs postgres` for S3 errors |
| JupyterLab shows "No such file or directory" on notebook | Jupyter started before volume was ready | Refresh the browser and re-open the notebook |
| `pgt_ducklake_provenance` has no rows | DuckLake sink not writing | Verify `sink = 'ducklake'` in `pgt_stream_tables` |
| DuckDB notebook gives "catalog not found" | pg_trickle catalog tables not initialised | Check `docker compose logs postgres` for init errors |
| Port 5432 conflict | Another PostgreSQL already running | Stop the local instance: `brew services stop postgresql` |
| Port 8888 conflict | Another Jupyter instance running | Stop it and re-run `docker compose up` |

---

## What you've built

- A **full OLTP-to-lake pipeline** in one `docker compose up`:
  write to PostgreSQL → aggregate differentially → publish Parquet deltas to
  MinIO → query history with DuckDB.
- **Sub-2-second end-to-end latency** from INSERT to DuckLake snapshot with no
  Kafka, no Flink, no Debezium.
- **Both operational and analytical access** from the same stream table: fast
  row reads from PostgreSQL, time-travel analytics from DuckDB.

---

## Stop and clean up

```bash
docker compose down -v
```

The `-v` flag removes the PostgreSQL and MinIO data volumes for a clean restart.

---

## Related resources

- [Tutorial 3: The Modern Data Stack in One Box](../../docs/tutorial-modern-data-stack-one-box.md)
- [Tutorial 4: Streaming PostgreSQL to a Data Lake](../../docs/tutorial-streaming-postgres-to-data-lake.md)
- [Demo: DuckLake Observability in a Box](../ducklake-observability/README.md)


A self-contained `docker compose up` demo of the full OLTP-to-data-lake loop:

1. Synthetic e-commerce orders land in a PostgreSQL `orders` table.
2. A pg_trickle stream table computes `revenue_by_region` incrementally.
3. The DuckLake sink publishes each delta to MinIO as a Parquet file.
4. A DuckDB notebook queries the historical Parquet files for trend analysis.

**End-to-end latency from order INSERT to DuckLake snapshot: typically under 2 seconds.**

---

## Quick Start

```bash
cd demos/ducklake-oltp-lake
docker compose up
```

- **JupyterLab**: http://localhost:8888 (token: `demo`) — open `oltp_lake_analysis.ipynb`
- **MinIO console**: http://localhost:9001 (minioadmin/minioadmin)

---

## Architecture

```
Order generator (Python)
  │  10 orders/second
  ▼
PostgreSQL 18 + pg_trickle
  │  orders table
  └─ stream table: revenue_by_region
       │  5-second DIFFERENTIAL refresh
       │  DuckLake sink (append mode)
       │  Provenance → pgtrickle.pgt_ducklake_provenance
       ▼
  MinIO (S3-compatible object store)
       │  Parquet delta files
       ▼
  DuckLake catalog
       │  ducklake_snapshot, ducklake_view
       ▼
  DuckDB (JupyterLab)
       └─ Trend analysis notebook
```

---

## Services

| Service | Port | Purpose |
|---------|------|---------|
| PostgreSQL 18 + pg_trickle | 5432 | OLTP source, stream table, DuckLake catalog |
| MinIO | 9000/9001 | S3-compatible object storage |
| JupyterLab | 8888 | DuckDB analytics notebook |
| Order generator | — | Python script producing 10 orders/s |

---

## The stream table

```sql
SELECT pgtrickle.create_stream_table(
    'revenue_by_region',
    query => $$
        SELECT
            region,
            date_trunc('minute', created_at) AS minute,
            SUM(amount)                       AS total_revenue,
            COUNT(*)                          AS order_count
        FROM orders
        GROUP BY region, date_trunc('minute', created_at)
    $$,
    schedule           => '5s',
    refresh_mode       => 'DIFFERENTIAL',
    sink               => 'ducklake',
    ducklake_sink_path => 's3://pg-trickle-demo/revenue_by_region/'
);
```

---

## Provenance query

```sql
SELECT
    stream_table_name,
    ducklake_snapshot_id,
    delta_row_count,
    written_at,
    written_at - LAG(written_at) OVER (ORDER BY written_at) AS cycle_duration
FROM pgtrickle.pgt_ducklake_provenance
WHERE stream_table_name = 'revenue_by_region'
ORDER BY written_at DESC
LIMIT 20;
```