[← Back to Blog Index](README.md)

# The Medallion Architecture Lives Inside PostgreSQL

## Bronze, Silver, Gold — without Spark, without Airflow, without leaving your database

---

The medallion architecture is a data engineering pattern. Raw data lands in a Bronze layer. It gets cleaned and deduplicated into Silver. Business-level aggregates live in Gold. Dashboards and applications read from Gold.

The pattern came from the Spark/Databricks world. Most implementations involve Spark jobs, Delta Lake tables, an Airflow DAG to orchestrate the pipeline, and a scheduler to run it all. The Bronze-to-Silver-to-Gold pipeline typically runs on a schedule — hourly, maybe every 15 minutes if you're aggressive.

pg_trickle implements the same architecture entirely inside PostgreSQL. No Spark. No Airflow. No external scheduler. Propagation time from Bronze to Gold: under 5 seconds.

---

## The Setup

Here's a concrete example: an e-commerce platform tracking orders, with fraud detection rules and executive-level KPIs.

### Bronze: Raw Ingest

Bronze is just your regular PostgreSQL tables. This is where your application writes.

```sql
CREATE TABLE orders (
    id          bigserial PRIMARY KEY,
    customer_id bigint NOT NULL,
    amount      numeric(12,2) NOT NULL,
    currency    text NOT NULL DEFAULT 'USD',
    status      text NOT NULL DEFAULT 'pending',
    created_at  timestamptz NOT NULL DEFAULT now()
);

CREATE TABLE customers (
    id       bigint PRIMARY KEY,
    name     text NOT NULL,
    region   text NOT NULL,
    tier     text NOT NULL DEFAULT 'standard',
    email    text
);
```

Nothing special here. No CDC pipelines, no event sourcing. Just tables.

### Silver: Cleaned, Enriched, Joined

The Silver layer is a stream table that joins, cleans, and enriches the raw data:

```sql
SELECT pgtrickle.create_stream_table(
    'silver_orders',
    $$SELECT
        o.id AS order_id,
        o.customer_id,
        c.name AS customer_name,
        c.region,
        c.tier AS customer_tier,
        o.amount,
        o.currency,
        o.status,
        o.created_at,
        CASE
          WHEN o.amount > 10000 AND c.tier = 'standard' THEN true
          ELSE false
        END AS flagged_for_review
      FROM orders o
      JOIN customers c ON c.id = o.customer_id$$,
    schedule     => '2s',
    refresh_mode => 'DIFFERENTIAL'
);
```

`silver_orders` is a real table. You can index it, query it with arbitrary WHERE clauses, put a GIN index on it for full-text search if you want. It updates within 2 seconds of any change to `orders` or `customers`.

Notice the `flagged_for_review` column — Silver isn't just a raw copy, it's enriched with business logic at the SQL level.

### Gold: Business Aggregates

The Gold layer builds on Silver. It aggregates into the shape your dashboards and APIs actually need:

```sql
-- Regional revenue KPIs
SELECT pgtrickle.create_stream_table(
    'gold_revenue_by_region',
    $$SELECT
        region,
        date_trunc('day', created_at) AS day,
        COUNT(*) AS order_count,
        SUM(amount) AS revenue,
        AVG(amount) AS avg_order_value,
        COUNT(*) FILTER (WHERE flagged_for_review) AS flagged_count
      FROM silver_orders
      WHERE status != 'cancelled'
      GROUP BY region, date_trunc('day', created_at)$$,
    schedule     => '3s',
    refresh_mode => 'DIFFERENTIAL'
);

-- Customer lifetime value
SELECT pgtrickle.create_stream_table(
    'gold_customer_ltv',
    $$SELECT
        customer_id,
        customer_name,
        region,
        customer_tier,
        SUM(amount) AS lifetime_value,
        COUNT(*) AS total_orders,
        MAX(created_at) AS last_order_at
      FROM silver_orders
      WHERE status != 'cancelled'
      GROUP BY customer_id, customer_name, region, customer_tier$$,
    schedule     => '3s',
    refresh_mode => 'DIFFERENTIAL'
);
```

---

## How the DAG Works

pg_trickle knows that `gold_revenue_by_region` depends on `silver_orders`, which depends on `orders` and `customers`. This forms a directed acyclic graph:

```
orders ──┐
          ├──→ silver_orders ──┬──→ gold_revenue_by_region
customers ┘                    └──→ gold_customer_ltv
```

The scheduler respects this ordering automatically. It won't refresh `gold_revenue_by_region` until `silver_orders` is up to date. You don't need to coordinate schedules — set the schedule on each layer and pg_trickle handles the propagation.

You can inspect the DAG:

```sql
SELECT * FROM pgtrickle.dependency_tree('gold_revenue_by_region');
```

This returns the full dependency chain, including depth level and refresh ordering.

---

## Why This Is Better Than the Spark Version

### Latency

The Spark medallion pipeline runs on a schedule — typically every 15 minutes to an hour. Between runs, your Gold layer is stale. pg_trickle's pipeline runs continuously. The end-to-end propagation time from an INSERT into `orders` to an updated row in `gold_revenue_by_region` is the sum of the schedules: 2 seconds (Silver) + 3 seconds (Gold) = 5 seconds worst case.

### Infrastructure

The Spark version requires:
- A Spark cluster (or Databricks workspace)
- Delta Lake or Iceberg for Bronze/Silver/Gold storage
- An Airflow/Dagster/Prefect DAG for orchestration
- A scheduler
- Monitoring for each step
- Permissions and credentials across systems

The pg_trickle version requires:
- PostgreSQL with the pg_trickle extension installed

That's it. The scheduler, CDC, DAG resolution, monitoring, and storage are all inside PostgreSQL.

### Consistency

In the Spark version, each layer is independently computed. If the Silver job fails halfway through, Gold may have partial data. Airflow retries help, but the state management is external.

In pg_trickle, each refresh cycle is a PostgreSQL transaction. It either fully commits or fully rolls back. There's no partial state. If a refresh fails, the stream table retains its previous consistent state and the scheduler retries on the next cycle.

### Cost

A dedicated Spark cluster costs real money — even on spot instances, the compute budget for a medallion pipeline is significant for small-to-medium teams. pg_trickle runs on your existing PostgreSQL instance. The marginal cost is CPU time during refresh cycles, which for most workloads is negligible.

---

## Adding More Layers

The DAG isn't limited to three levels. You can chain as many stream tables as your use case needs:

```sql
-- A "platinum" layer: top-10 customers per region, for the exec dashboard
SELECT pgtrickle.create_stream_table(
    'platinum_top_customers_by_region',
    $$SELECT region, customer_id, customer_name, lifetime_value
      FROM gold_customer_ltv
      ORDER BY lifetime_value DESC
      LIMIT 10$$,
    schedule     => '5s',
    refresh_mode => 'DIFFERENTIAL'
);
```

pg_trickle supports arbitrary DAG depth. Each additional layer adds its schedule interval to the end-to-end propagation time, but the refresh cost per layer is proportional only to the delta at that layer — not the total data volume.

---

## Monitoring the Pipeline

Since every stream table has its own refresh stats, you can monitor the pipeline the same way you'd monitor any pg_trickle stream table:

```sql
-- Check freshness across all layers
SELECT
    st.pgt_name,
    st.refresh_mode,
    st.schedule,
    now() - st.data_timestamp AS staleness,
    st.consecutive_errors
FROM pgtrickle.pgt_status() st
WHERE st.pgt_name LIKE 'silver_%' OR st.pgt_name LIKE 'gold_%' OR st.pgt_name LIKE 'platinum_%'
ORDER BY st.pgt_name;
```

If you enable pg_trickle's self-monitoring (which itself uses stream tables), the extension watches its own refresh latency and alerts when a layer falls behind.

---

## When to Still Use Spark

pg_trickle's medallion architecture works inside a single PostgreSQL instance (or a Citus cluster). If your Bronze layer is petabytes of Parquet files in S3, Spark is the right tool. If your Bronze layer is a PostgreSQL table that your application writes to, the Spark pipeline was always overkill.

The dividing line is roughly: if your data fits in PostgreSQL, your medallion architecture should too.