[← Back to Blog Index](README.md)

# How pg_trickle Handles Diamond Dependencies

## The refresh ordering problem that nobody talks about until it causes double-counting

---

If you've used pg_trickle for more than a few stream tables, you've probably built a DAG. Stream table C depends on A and B. A and B each depend on the same source table. You refresh, and everything works.

But there's a subtle correctness problem hiding in that topology. It's called a diamond dependency, and if your IVM engine doesn't handle it, your aggregates will be wrong.

---

## The Diamond

Here's the shape:

```
                source_table
               /            \
              ▼              ▼
          st_A (agg1)    st_B (agg2)
              \              /
               ▼            ▼
              st_C (combines A and B)
```

`st_A` and `st_B` are both stream tables that depend on `source_table`. `st_C` is a stream table that JOINs or UNIONs the output of `st_A` and `st_B`.

The problem: when `source_table` changes, the delta needs to propagate through both A and B before C can safely refresh. If C refreshes after A but before B, it sees a partially-updated state. Depending on the query, this can cause double-counting, missing rows, or incorrect aggregates.

---

## A Concrete Example

Consider a sales analytics platform:

```sql
-- Source table
CREATE TABLE sales (
    id          bigserial PRIMARY KEY,
    region      text NOT NULL,
    product_id  bigint NOT NULL,
    amount      numeric(12,2) NOT NULL,
    created_at  timestamptz NOT NULL DEFAULT now()
);

-- Branch A: revenue by region
SELECT pgtrickle.create_stream_table(
    'revenue_by_region',
    $$SELECT region, SUM(amount) AS revenue, COUNT(*) AS sale_count
      FROM sales GROUP BY region$$,
    schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);

-- Branch B: revenue by product
SELECT pgtrickle.create_stream_table(
    'revenue_by_product',
    $$SELECT product_id, SUM(amount) AS revenue, COUNT(*) AS sale_count
      FROM sales GROUP BY product_id$$,
    schedule => '2s', refresh_mode => 'DIFFERENTIAL'
);

-- Diamond tip: executive summary combining both
SELECT pgtrickle.create_stream_table(
    'exec_summary',
    $$SELECT
        'total' AS label,
        (SELECT SUM(revenue) FROM revenue_by_region) AS regional_total,
        (SELECT SUM(revenue) FROM revenue_by_product) AS product_total
    $$,
    schedule => '3s', refresh_mode => 'DIFFERENTIAL'
);
```

In a correct state, `regional_total` and `product_total` should always be equal — they're both `SUM(amount)` over the same source data, just grouped differently.

Without diamond-aware scheduling, here's what can happen:

1. A new sale for $1,000 is inserted into `sales`.
2. The scheduler refreshes `revenue_by_region` — it picks up the new sale.
3. The scheduler refreshes `exec_summary` — it sees the updated `revenue_by_region` but the old `revenue_by_product`.
4. `regional_total = $101,000`, `product_total = $100,000`. They disagree.
5. The scheduler refreshes `revenue_by_product` — now it catches up.
6. Next refresh of `exec_summary` fixes the inconsistency.

For a few seconds, your executive dashboard showed inconsistent numbers. In a financial system, this is a bug. In an audit, this is a finding.

---

## How pg_trickle Solves This

pg_trickle's DAG resolver identifies diamond dependencies at stream table creation time. When you create `exec_summary`, the engine discovers the diamond:

```
sales → revenue_by_region  → exec_summary
sales → revenue_by_product → exec_summary
```

Both paths originate from `sales` and converge at `exec_summary`. pg_trickle records this as a **diamond group**.

You can see it:

```sql
SELECT * FROM pgtrickle.diamond_groups();
```

This returns the set of stream tables that share a common ancestor and converge at a common descendant.

### The Scheduling Rule

The scheduler enforces a simple invariant: **all members of a diamond group must be refreshed to the same frontier before the convergence point is refreshed.**

In practice, this means:
1. `sales` changes.
2. The scheduler refreshes `revenue_by_region` from the change buffer.
3. The scheduler refreshes `revenue_by_product` from the same change buffer epoch.
4. Only after both have been refreshed to the same frontier does `exec_summary` become eligible for refresh.

Steps 2 and 3 can happen in any order, or even in parallel (if you have multiple worker slots). But step 4 is blocked until both are complete.

This is the **frontier tracker** at work. Each stream table has a frontier — a version vector that tracks which changes it has incorporated. The scheduler won't refresh a downstream table unless all its upstream dependencies have frontiers that are at least as advanced as the change epoch being processed.

---

## What This Costs

Diamond-aware scheduling adds a small amount of latency to the convergence point. Instead of refreshing `exec_summary` as soon as any upstream changes, it waits for all upstreams in the diamond group to catch up.

In practice, this wait is bounded by the slowest branch of the diamond — typically a few hundred milliseconds. For the correctness guarantee you get in return, this is a good trade.

If your topology doesn't have diamonds, there's zero overhead. The scheduler only applies the diamond constraint when `diamond_groups()` identifies a non-empty set of convergence points.

---

## Deeper Diamonds

Diamonds can be nested. Consider:

```
source_1 ──→ st_A ──→ st_C ──→ st_E
source_1 ──→ st_B ──→ st_C
source_2 ──→ st_B ──→ st_D ──→ st_E
source_2 ──→ ──────→ st_D
```

There are two diamonds here: one at `st_C` (from `source_1` via A and B) and one at `st_E` (from `source_2` via B/C and D). pg_trickle handles arbitrarily nested diamonds — the frontier tracker works at every level of the DAG.

---

## Detecting Diamonds in Existing Deployments

If you're adding a new stream table to an existing DAG and want to check whether it creates a diamond:

```sql
-- Before creating the new stream table, check the current DAG
SELECT * FROM pgtrickle.dependency_tree('your_new_stream_table');

-- After creating it, check for diamond groups
SELECT * FROM pgtrickle.diamond_groups();
```

pg_trickle also logs a notice when a diamond is detected at creation time:

```
NOTICE: stream table "exec_summary" creates a diamond dependency via
        "sales" → ["revenue_by_region", "revenue_by_product"].
        Refresh scheduling will ensure frontier consistency.
```

---

## Why Other Systems Get This Wrong

Most materialized view systems don't handle diamonds at all, because they don't have a concept of incremental refresh ordering. `REFRESH MATERIALIZED VIEW` is a full recomputation — there's no delta to propagate incorrectly, so there's no diamond problem. (There's also no performance, but that's a different post.)

External IVM systems that do handle incremental deltas often punt on diamonds by documenting it as a known limitation: "avoid creating cyclic or diamond dependency topologies." pg_trickle treats it as a core correctness requirement.