[← Back to Blog Index](README.md)

# DISTINCT That Doesn't Recount

## Reference counting for incremental deduplication

---

`SELECT DISTINCT` in a materialized view means a full table scan on every refresh. PostgreSQL has to see all the rows to determine which are unique. There's no shortcut — without knowing the full data set, you can't know if a row is duplicated.

pg_trickle has a shortcut: **reference counting.** Each distinct value gets a counter tracking how many source rows produce it. Insert a duplicate? Increment the counter. Delete a row? Decrement the counter. When the counter hits zero, the value is removed from the result.

No scan of existing data. O(delta) per refresh.

---

## The Problem

```sql
SELECT DISTINCT region, product_category
FROM orders
JOIN customers ON customers.id = orders.customer_id;
```

This query deduplicates (region, product_category) pairs. With 50 million orders across 200 unique pairs, a full refresh scans 50 million rows to produce 200 rows.

With IVM, 10 new orders come in. Do they create any new (region, product_category) pairs? Or are all 10 in existing pairs? To answer this without scanning, you need to know how many rows currently produce each pair.

---

## The __pgt_dup_count Column

pg_trickle maintains a hidden column `__pgt_dup_count` on stream tables that use DISTINCT. This column tracks the multiplicity of each row in the result — how many source rows produce it.

```sql
-- What the stream table actually stores (internal representation)
SELECT region, product_category, __pgt_dup_count
FROM pgtrickle.distinct_pairs;
```

```
  region    | product_category | __pgt_dup_count
------------+------------------+-----------------
 Northeast  | Electronics      | 12,847
 Northeast  | Clothing         | 8,432
 Southeast  | Electronics      | 15,291
 ...        | ...              | ...
```

The user-visible query (`SELECT * FROM distinct_pairs`) hides `__pgt_dup_count` — you just see the distinct values.

---

## Delta Rules

**INSERT a row that matches an existing distinct value:**
```
__pgt_dup_count += 1
```
No new row is added to the result. The value was already there.

**INSERT a row with a new distinct value:**
```
INSERT row with __pgt_dup_count = 1
```
New distinct value appears in the result.

**DELETE a row:**
```
__pgt_dup_count -= 1
If __pgt_dup_count = 0: DELETE the row from the result
```
The distinct value disappears only when all source rows producing it are gone.

**UPDATE a row (changes the distinct columns):**
```
Decrement old value's __pgt_dup_count (possibly remove it)
Increment new value's __pgt_dup_count (possibly insert it)
```

---

## Example

```sql
SELECT pgtrickle.create_stream_table(
  name  => 'active_regions',
  query => $$
    SELECT DISTINCT c.region
    FROM orders o
    JOIN customers c ON c.id = o.customer_id
    WHERE o.status = 'active'
  $$,
  schedule => '5s'
);
```

Initial state: 5 regions, each with thousands of active orders.

A customer in the "Pacific" region places their first order. Before this, "Pacific" had 0 active orders — it wasn't in the result. Now it has 1.

pg_trickle:
1. Sees the INSERT in the change buffer.
2. Joins with `customers` to get region = "Pacific".
3. Checks: does "Pacific" exist in the result? No → INSERT with `__pgt_dup_count = 1`.

Later, the only Pacific order is cancelled (status changes to 'cancelled', which is filtered out by `WHERE status = 'active'`). The effective delta is a DELETE of that row.

pg_trickle:
1. Decrements Pacific's `__pgt_dup_count` from 1 to 0.
2. Removes "Pacific" from the result.

All other regions are untouched. The refresh processes 1 row, not millions.

---

## DISTINCT ON

PostgreSQL's `DISTINCT ON` is a different feature from `DISTINCT`. It returns one row per group, ordered by a specified column:

```sql
SELECT DISTINCT ON (customer_id)
  customer_id, order_id, total, created_at
FROM orders
ORDER BY customer_id, created_at DESC;
```

This returns the most recent order per customer. It's a common pattern for "latest row per group."

pg_trickle handles `DISTINCT ON` with the same reference-counting approach, but the tie-breaking logic is more complex. The stream table maintains the winning row (based on the ORDER BY) and updates it when:

- A new row with a higher sort value is inserted (it becomes the new winner).
- The current winner is deleted (the next-best row becomes the winner).

This requires knowing what the "next-best" row is, which in turn requires a lookup against the source data. The cost is O(changed groups) — for each group that was affected by the delta, one query to find the new winner.

---

## DISTINCT with Expressions

DISTINCT can appear with computed columns:

```sql
SELECT DISTINCT date_trunc('month', created_at) AS month
FROM events;
```

The reference counting applies to the *computed* value, not the raw column. Two events on different days in the same month produce the same distinct value and share a `__pgt_dup_count`.

---

## Performance

The reference-count approach makes DISTINCT maintenance O(|ΔT|) — proportional to the number of changed rows, not the table size.

| Scenario | FULL refresh | DIFFERENTIAL refresh |
|----------|-------------|---------------------|
| 10 inserts, all in existing groups | ~500ms (full scan) | <1ms (10 counter increments) |
| 10 inserts, 2 new groups | ~500ms | <1ms (8 increments + 2 inserts) |
| 10 deletes, none empties a group | ~500ms | <1ms (10 counter decrements) |
| 10 deletes, 1 group drops to zero | ~500ms | <1ms (9 decrements + 1 delete) |

The only scenario where DIFFERENTIAL doesn't help is when the delta touches every group — but that's rare for DISTINCT queries, which by definition have fewer groups than rows.

---

## When Not to Use DISTINCT in Stream Tables

DISTINCT adds the `__pgt_dup_count` overhead to every row. If your query naturally produces unique rows (e.g., `GROUP BY` with a key that guarantees uniqueness), adding DISTINCT is redundant and wasteful.

Check with `EXPLAIN`:

```sql
EXPLAIN SELECT DISTINCT region, SUM(total)
FROM orders GROUP BY region;
```

If the `GROUP BY` already produces unique (region) rows, the DISTINCT is a no-op. Remove it — pg_trickle still maintains the stream table correctly, without the reference-counting overhead.

---

## Summary

DISTINCT in stream tables uses reference counting (`__pgt_dup_count`) to avoid full-scan deduplication. Insert increments, delete decrements, and rows are removed only when the count reaches zero.

The cost is O(delta), not O(table). For the common case — many source rows, few distinct values, small changes per cycle — this is orders of magnitude faster than recomputing.

DISTINCT ON works similarly but with tie-breaking logic. Remove DISTINCT if GROUP BY already ensures uniqueness. And don't worry about the hidden counter column — it's invisible to your queries.