[← Back to Blog Index](https://trickle-labs.github.io/pg-trickle/blog/) | [Documentation](https://trickle-labs.github.io/pg-trickle/) # The Modern Data Stack in One Box ## Replace five infrastructure systems with one PostgreSQL instance and an S3 bucket --- The modern data stack has a well-known problem: it requires too many systems. A typical company running analytics in 2025 maintains: 1. **An OLTP database** (PostgreSQL) — the application database 2. **A CDC pipeline** (Debezium or AWS DMS) — captures changes from the OLTP DB 3. **A message broker** (Kafka) — buffers the change stream 4. **A stream processor** (Flink or Spark Streaming) — computes rolling aggregations 5. **A data lake** (Apache Iceberg on S3) — stores historical data cheaply 6. **A catalog service** (Apache Polaris or Nessie) — makes the lake queryable 7. **An analytics engine** (DuckDB, Trino, or Spark) — ad-hoc queries 8. **A workflow orchestrator** (Airflow or Dagster) — wires everything together Each system has its own operations team, its own failure modes, and its own cloud bill. A rough industry estimate puts the total cost of running this stack for a mid-size company at $40,000–$200,000 per month. This tutorial shows how to replace most of that with one PostgreSQL instance and an S3 bucket — using pg_trickle for incremental aggregations and DuckLake for lakehouse storage. No Kafka. No Flink. No JVM. No separate orchestration. --- ## The Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ PostgreSQL 18 │ │ │ │ ┌─────────────────┐ ┌───────────────────────────────────┐ │ │ │ OLTP Tables │ │ pg_trickle │ │ │ │ (orders, │───▶│ Stream tables: │ │ │ │ products, │ │ • revenue_by_day │ │ │ │ customers) │ │ • inventory_forecast │ │ │ └─────────────────┘ │ • customer_lifetime_value │ │ │ │ • product_performance │ │ │ ┌─────────────────┐ └────────────────┬──────────────────┘ │ │ │ DuckLake │ │ │ │ │ Catalog │◀────────────────────┘ (results published) │ │ │ (~28 tables) │ │ │ └─────────────────┘ │ └──────────────────────────────┬──────────────────────────────────┘ │ S3 / MinIO (Parquet data files) │ ┌──────────────────┼──────────────────┐ │ │ │ DuckDB Apache Spark Trino (ad-hoc queries) (batch processing) (SQL queries) ``` The one architectural insight that makes this possible: **DuckLake stores its entire catalog in PostgreSQL.** The metadata tables for your data lake are just regular PostgreSQL tables. pg_trickle lives in that same database. There are no network hops, no serialisation costs, no authentication boundaries between the incremental compute engine and the lake catalog. --- ## What You'll Build This tutorial builds the OLTP-to-lake pipeline for a small SaaS company: - **OLTP tables:** `orders`, `products`, `customers` - **pg_trickle stream tables:** real-time revenue, inventory, and CLV aggregations - **DuckLake tables:** historical archive of the same data in Parquet on S3 - **DuckDB queries:** ad-hoc analytics over the full history plus the live aggregations --- ## Prerequisites - PostgreSQL 18 with pg_trickle installed - DuckDB CLI with the `ducklake` extension - MinIO or AWS S3 - Python 3.11+ (for the data generator script) --- ## Step 1: Create the OLTP Tables ```sql -- Standard OLTP schema CREATE TABLE customers ( id SERIAL PRIMARY KEY, email TEXT NOT NULL UNIQUE, name TEXT NOT NULL, plan TEXT NOT NULL DEFAULT 'free', -- 'free', 'pro', 'enterprise' created_at TIMESTAMPTZ NOT NULL DEFAULT now() ); CREATE TABLE products ( id SERIAL PRIMARY KEY, sku TEXT NOT NULL UNIQUE, name TEXT NOT NULL, category TEXT NOT NULL, price_usd NUMERIC(10,2) NOT NULL, stock_count INT NOT NULL DEFAULT 0 ); CREATE TABLE orders ( id BIGSERIAL PRIMARY KEY, customer_id INT NOT NULL REFERENCES customers(id), product_id INT NOT NULL REFERENCES products(id), quantity INT NOT NULL DEFAULT 1, total_usd NUMERIC(10,2) NOT NULL, status TEXT NOT NULL DEFAULT 'pending', -- 'pending', 'shipped', 'cancelled' ordered_at TIMESTAMPTZ NOT NULL DEFAULT now() ); ``` --- ## Step 2: Create pg_trickle Stream Tables ```sql CREATE EXTENSION IF NOT EXISTS pg_trickle; -- Revenue by day and product category SELECT pgtrickle.create_stream_table( name => 'revenue_by_day', query => $$ SELECT date_trunc('day', o.ordered_at) AS day, p.category, SUM(o.total_usd) AS revenue_usd, COUNT(*) AS order_count, COUNT(DISTINCT o.customer_id) AS unique_customers FROM orders o JOIN products p ON p.id = o.product_id WHERE o.status = 'shipped' GROUP BY date_trunc('day', o.ordered_at), p.category $$, schedule => '10s', refresh_mode => 'DIFFERENTIAL' ); -- Customer lifetime value SELECT pgtrickle.create_stream_table( name => 'customer_lifetime_value', query => $$ SELECT c.id, c.email, c.plan, SUM(o.total_usd) AS lifetime_value_usd, COUNT(*) AS total_orders, MAX(o.ordered_at) AS last_order_at, MIN(o.ordered_at) AS first_order_at FROM customers c LEFT JOIN orders o ON o.customer_id = c.id AND o.status = 'shipped' GROUP BY c.id, c.email, c.plan $$, schedule => '30s', refresh_mode => 'DIFFERENTIAL' ); -- Inventory alert: products with low stock relative to recent demand SELECT pgtrickle.create_stream_table( name => 'inventory_health', query => $$ SELECT p.id, p.sku, p.name, p.stock_count, COALESCE(recent.units_sold_7d, 0) AS units_sold_7d, CASE WHEN p.stock_count = 0 THEN 'OUT_OF_STOCK' WHEN p.stock_count < COALESCE(recent.units_sold_7d, 0) THEN 'LOW_STOCK' ELSE 'OK' END AS stock_status FROM products p LEFT JOIN ( SELECT product_id, SUM(quantity) AS units_sold_7d FROM orders WHERE status = 'shipped' AND ordered_at >= now() - interval '7 days' GROUP BY product_id ) recent ON recent.product_id = p.id $$, schedule => '1m', refresh_mode => 'DIFFERENTIAL' ); ``` --- ## Step 3: Set Up DuckLake for Historical Storage In DuckDB, set up the DuckLake catalog pointing at your PostgreSQL instance: ```sql -- DuckDB shell INSTALL ducklake; LOAD ducklake; -- Attach PostgreSQL as the catalog database ATTACH 'ducklake:postgres:host=localhost dbname=your_db user=postgres password=secret' AS company_lake (TYPE DUCKLAKE, DATA_PATH 's3://company-data-lake/'); USE company_lake; -- Create historical mirror tables in DuckLake CREATE TABLE orders_archive AS SELECT * FROM postgres_query( 'host=localhost dbname=your_db user=postgres password=secret', 'SELECT * FROM orders' ); CREATE TABLE revenue_archive AS SELECT * FROM postgres_query( 'host=localhost dbname=your_db user=postgres password=secret', 'SELECT * FROM revenue_by_day' ); ``` The `revenue_archive` table in DuckLake is a snapshot of the pg_trickle stream table results. You can query it from any DuckDB client, Spark, or Trino — or join it against the live `revenue_by_day` stream table via `postgres_scanner`. --- ## Step 4: Query Across Live and Historical Data With DuckDB's `postgres_scanner`, you can join DuckLake's historical Parquet data with pg_trickle's live stream tables in a single query: ```sql -- DuckDB shell: join historical archive with live stream table results WITH historical AS ( SELECT * FROM company_lake.revenue_archive WHERE day < current_date - 30 ), live AS ( SELECT * FROM postgres_scan( 'host=localhost dbname=your_db', 'public', 'revenue_by_day' ) WHERE day >= current_date - 30 ) SELECT category, SUM(revenue_usd) AS total_revenue_90d, COUNT(DISTINCT day) AS active_days FROM (SELECT * FROM historical UNION ALL SELECT * FROM live) GROUP BY category ORDER BY total_revenue_90d DESC; ``` This query seamlessly combines: - **Historical data**: old Parquet files in DuckLake (cheap, columnar, S3-stored) - **Live data**: pg_trickle stream tables in PostgreSQL (sub-second fresh) No ETL pipeline moves data between systems. No sync job runs at midnight. The two datasets live in different physical locations but are queryable as one via the shared PostgreSQL metadata layer that DuckLake provides. --- ## Step 5: Dashboard in Grafana Connect Grafana to your PostgreSQL instance and query the stream tables directly: ```sql -- Panel: Daily revenue trend (last 30 days) SELECT $__timeGroupAlias(day, '1d'), category, SUM(revenue_usd) AS "Revenue (USD)" FROM revenue_by_day WHERE $__timeFilter(day) GROUP BY 1, category ORDER BY 1; ``` ```sql -- Panel: Customer value distribution by plan SELECT plan, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY lifetime_value_usd) AS p50_ltv, PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY lifetime_value_usd) AS p90_ltv, AVG(lifetime_value_usd) AS avg_ltv FROM customer_lifetime_value GROUP BY plan; ``` ```sql -- Panel: Inventory alerts SELECT sku, name, stock_count, units_sold_7d, stock_status FROM inventory_health WHERE stock_status != 'OK' ORDER BY stock_status DESC, units_sold_7d DESC; ``` --- ## System Comparison | Task | Classic Stack | This Stack | |------|--------------|------------| | OLTP storage | PostgreSQL | PostgreSQL | | CDC pipeline | Debezium + Kafka | pg_trickle (built in) | | Stream processing | Flink / Spark Streaming | pg_trickle | | Historical storage | Iceberg + catalog service | DuckLake on S3 | | Ad-hoc analytics | Trino / Spark | DuckDB | | Dashboards | Grafana → Kafka/Flink | Grafana → PostgreSQL | | Orchestration | Airflow / Dagster | Not needed | | Total systems | 7–9 | 2 | | Estimated monthly cost | $40k–$200k | $500–$5k | The cost difference is real. The complexity difference is visible in the architecture diagram above: the classic stack has seven nodes with seven sets of configuration, deployment pipelines, monitoring, and on-call runbooks. This stack has one node for compute (PostgreSQL) and one for storage (S3). --- ## What You Lose This simplified stack has genuine limitations: - **Sub-100ms latency at extreme scale:** If you need single-digit millisecond refresh latency for millions of rows per second, the dedicated stream processors (Flink, Materialize) will outperform pg_trickle's refresh cycle approach. - **Multi-database DAGs:** All stream tables must live in one PostgreSQL instance. Cross-database dependencies require federation via foreign tables. - **Non-SQL sources:** Kafka topics, IoT device streams, and webhook feeds require a bridge layer to land data in PostgreSQL before pg_trickle can watch them. - **Petabyte-scale DuckLake tables:** For enormous tables, the full-scan polling CDC path is too expensive. Use the DuckLake inlined-data trigger path or wait for the Phase 2 DuckLake change-feed adapter (planned for v0.65.0). For most analytics workloads at a company processing under 10M events/day, none of these limitations are binding constraints. --- ## Summary The "modern data stack" complexity problem is real, but it is also largely self-inflicted. DuckLake and pg_trickle combine two radical simplifications — "the lake catalog is just SQL" and "incremental aggregations are just SQL" — into a stack that fits in your head and on a single server. One PostgreSQL instance. One S3 bucket. Sub-second fresh aggregations. Historical queries over years of Parquet data. Live dashboards without a streaming cluster. The full story is now just boring infrastructure. --- *See also:* - [Tutorial: Real-Time Dashboards on Your Data Lake](ducklake-real-time-dashboards.md) - [Tutorial: Monitoring Your DuckLake with pg_trickle](ducklake-monitoring.md) - [Blog: Why pg_trickle + DuckLake Is the Missing Piece for Lakehouse IVM](ducklake-ivm-missing-piece.md) - [Foreign Tables as Stream Table Sources](foreign-table-sources.md)