# TPC-H Performance Comparison: ORCA vs PostgreSQL Native Planner

Test data from `test/results/tpch_orca.out` (pg_orca) vs `test/expected/tpch_pg.out` (stock PG
planner), both run on the same TPC-H SF=1 dataset with parallelism disabled.

## Execution Time

| Query | PG Exec (ms) | ORCA Exec (ms) | Ratio (ORCA/PG) | Winner |
|-------|-------------|----------------|-----------------|--------|
| Q1    | 10,394      | 10,781         | 1.04x           | ≈      |
| Q2    | 517         | 868            | 1.68x           | PG     |
| Q3    | 2,953       | 3,120          | 1.06x           | ≈      |
| Q4    | 1,184       | 4,330          | **3.66x**       | PG     |
| Q5    | 1,220       | 2,785          | **2.28x**       | PG     |
| Q6    | 2,238       | 2,209          | 0.99x           | ≈      |
| Q7    | 2,014       | 2,946          | 1.46x           | PG     |
| Q8    | 906         | 940            | 1.04x           | ≈      |
| Q9    | 8,650       | 4,355          | **0.50x**       | **ORCA 2x** |
| Q10   | 2,933       | 2,920          | 1.00x           | ≈      |
| Q11   | 548         | 672            | 1.23x ✓ (fixed) | ≈      |
| Q12   | 3,616       | 3,421          | 0.95x           | ≈      |
| Q13   | 1,696       | 1,766          | 1.04x           | ≈      |
| Q14   | 2,235       | 2,180          | 0.98x           | ≈      |
| Q15   | 2,232       | 2,219          | 0.99x           | ≈      |
| Q16   | 1,137       | 626            | **0.55x**       | **ORCA 1.8x** |
| Q17   | 1,843       | 2,168          | 1.18x ✓ (fixed) | ≈      |
| Q18   | 8,228       | 9,211          | 1.12x           | ≈      |
| Q19   | 151         | 158            | 1.04x           | ≈      |
| Q20   | 761         | 3,454          | **4.54x**       | PG     |
| Q21   | 3,829       | 9,670          | **2.53x**       | PG     |
| Q22   | 649         | 1,064          | 1.64x           | PG     |

**Total (22 queries):** PG ~59,935 ms vs ORCA ~71,863 ms — ORCA is **1.20x slower** overall.


## Key Findings

**ORCA wins:**
- Q9: 2x faster — better join order for the 8-way join with correlated subquery
- Q16: 1.8x faster — better plan for the `NOT IN` anti-join with part/supplier

**Remaining regressions:**
- Q20 (4.5x): subquery decorrelation gone wrong — see analysis below
- Q4 (3.7x): EXISTS subquery unnesting — see analysis below
- Q5 (2.3x): multi-table join chain, join order suboptimal
- Q21 (2.5x): complex multi-join with semi/anti-join
- Q2 (1.7x): join order / index usage

**Planning overhead:**
- ORCA planning is 50x–1150x slower than PG due to exhaustive search. On short-running
  queries this overhead is a significant fraction of total wall time.

---

## Regression Analysis

### Q11 — Fixed ✓

**Commit:** `46b96a6 cost: penalize index scan on non-leading composite key column`

**Root cause:** `partsupp_pkey` is `(ps_partkey, ps_suppkey)`. ORCA generated a Nested Loop
where the inner side was `Index Scan using partsupp_pkey` with condition
`ps_suppkey = supplier.s_suppkey` — using the *non-leading* column. Because the B-tree is
ordered by `ps_partkey` first, this required scanning almost the entire index (~2,196 pages)
per lookup. ORCA's cost model treated it identically to a leading-column scan, underestimating
the cost by ~50,000x.

**Fix:** In `CostIndexScan()` (`libgpdbcost/src/CCostModelGPDB.cpp`), when the predicate does
not reference the first key column of a composite index, scale `dIndexScanTupRandomFactor` by
`table_pages / rows_per_rebind`. This reflects that O(table_pages) index entries must be read
rather than a small range, making the seq-scan + hash-join alternative cheaper.

**Result:** Q11 execution time 11,281 ms → **672 ms** (16.8x improvement). The 2014 TODO
comment in `CostIndexScan` that requested exactly this logic was removed.

Also resolved as a side effect: **Q17** (previously commented out as unsupported) now runs
correctly at 2,168 ms (1.18x vs PG).

---

### Q20 — Open

**Symptom:** 3,454 ms vs PG 761 ms (4.5x slower). `work_mem` increase to 256 MB eliminates
disk spill but gives no improvement (3,618 ms) — disk spill was not the bottleneck.

**Root cause:** Q20 contains a correlated scalar subquery in a `HAVING`-style filter:

```sql
ps_availqty > (SELECT 0.5 * sum(l_quantity) FROM lineitem
               WHERE l_partkey = ps_partkey AND l_suppkey = ps_suppkey
               AND l_shipdate BETWEEN '1994-01-01' AND '1995-01-01')
```

ORCA **eagerly decorrelates** this into a full `HashAggregate(l_partkey, l_suppkey)` over all
of lineitem before joining with the outer query. PG keeps it as a correlated `SubPlan` and
evaluates it lazily only for the 8,508 qualifying `(ps_partkey, ps_suppkey)` pairs.

| | PG (lazy SubPlan) | ORCA (eager decorrelation) |
|--|---|---|
| lineitem rows processed | 8,508 × ~1.1 rows via Bitmap Index Scan | **909,455 rows** (full year scan) |
| Aggregate groups | 8,508 executions | **543,210 groups** (64x more) |
| Disk spill | none | Batches=5, 11,872 KB (at 64 MB work_mem) |
| ORCA cardinality estimate for Hash Join | — | `rows=1` (actual 5,833 — wildly wrong) |

The decorrelated plan processes 64x more lineitem data. ORCA's cardinality estimate for the
join result (`rows=1`, actual 5,833) masked the true cost and led to selecting this path.

**Fix direction:** The subquery unnesting transform should compare the expected size of the
decorrelated intermediate result vs. the cost of correlated execution (outer rows × subplan
cost). When `outer_qualifying_rows << decorrelated_result_rows`, keeping the SubPlan wins.

---

### Q4 — Open

**Symptom:** 4,330 ms vs PG 1,184 ms (3.66x slower).

**Root cause:** Q4 filters orders by date range and uses `EXISTS` to require at least one
late-delivery lineitem:

```sql
EXISTS (SELECT * FROM lineitem
        WHERE l_orderkey = o_orderkey AND l_commitdate < l_receiptdate)
```

ORCA unnests `EXISTS` into a **full deduplication + hash join**:

```
SeqScan lineitem (l_commitdate < l_receiptdate) → 3,793,296 rows
  HashAggregate(l_orderkey) → 1,375,365 distinct keys   (88 MB hash table)
    Hash Join with orders (57,218 rows)
```

PG uses a **Nested Loop Semi Join** with `lineitem_pkey(l_orderkey, l_linenumber)`:

```
SeqScan orders → 57,218 rows
  Index Scan lineitem_pkey (l_orderkey = o_orderkey, filter l_commitdate < l_receiptdate)
    57,218 lookups × ~0.92 rows (semi-join stops at first match)
```

| | PG (NL Semi Join + index) | ORCA (full dedup + hash join) |
|--|---|---|
| lineitem rows accessed | ~114K (57K × ~2 via index) | **3,793,296** (full scan) |
| Deduplication | none (semi-join semantics) | HashAggregate, 1.375M groups, 88 MB |
| Index used | `lineitem_pkey` leading col `l_orderkey` ✓ | none |
| ORCA row estimate | — | HashAggregate `rows=437,556` (actual 1,375,365, **3.1x off**) |

ORCA scans 33x more lineitem data. The cardinality underestimate (3.1x) hides the true hash
table size and causes ORCA to prefer this plan.

**Pattern shared with Q20 and Q21:** Q4, Q20, and Q21 all involve ORCA eagerly unnesting a
subquery into a full-scan + aggregate/dedup or Hash Semi/Anti Join before applying outer
filters, while PG's correlated or semi-join execution exploits available indexes and applies
filters early. The specific cause for Q4/Q21 is structural: ORCA lacks
`EopPhysicalLeftSemiIndexNLJoin` — see Q21 analysis for detail.

**Fix direction:** The EXISTS unnesting transform (`CXformExistSubq2Join` or equivalent) should
estimate the cost of keeping the correlated semi-join path. When the outer driving table is
small and an index covers the inner join key as a leading column, the nested-loop semi-join
is likely cheaper than full-scan + dedup.

---

### Q21 — Open

**Symptom:** 9,670 ms vs PG 3,829 ms (2.53x slower).

**Query structure:** Q21 finds suppliers who had waiting shipments. It joins `supplier`,
`lineitem l1` (late-delivery filter), `orders`, and uses:
- `NOT EXISTS` on `l3` (same order, different supplier, also late) — Anti Join
- `EXISTS` on `l2` (same order, different supplier) — Semi Join

**ORCA plan:**

```
Hash Join l1+supplier+nation ⋈ orders (729K-row hash, no spill)
  Hash Semi Join l1 ⋈ l2 (EXISTS, l_orderkey)
    Hash Anti Join l1 ⋈ l3 (NOT EXISTS, l_orderkey)
      Hash Join l1 ⋈ supplier ⋈ nation  →  156,739 rows
        Seq Scan l1 (l_receiptdate > l_commitdate)  →  3,793,296 rows
    Hash l3: Seq Scan lineitem (3,793,296 rows) — Batches=2, disk spill 6,957 pages
  Hash l2: Seq Scan lineitem (6,001,215 rows) — Batches=4, disk spill 15,376 pages
```

**PG plan:**

```
Nested Loop → orders (Index Scan orders_pkey, 8,357 lookups)
  Nested Loop Semi Join → l2 (Index Scan lineitem_pkey, 13,859 lookups × ~0.6 rows)
    Nested Loop Anti Join → l3 (Index Scan lineitem_pkey, 156,739 lookups × ~0.9 rows)
      Hash Join l1 ⋈ supplier ⋈ nation  →  156,739 rows
        Seq Scan l1 (l_receiptdate > l_commitdate)  →  3,793,296 rows
```

**Data volume comparison:**

| | PG (NL + index) | ORCA (Hash + full scan) |
|--|---|---|
| l3 rows read | ~142K (156,739 × 0.91 via index) | **3,793,296** (full scan + disk spill) |
| l2 rows read | ~8,300 (13,859 × 0.60 via index) | **6,001,215** (full scan + disk spill) |
| orders rows | 8,357 (index lookups) | 729,413 (full hash table) |
| Disk spill | none | l3: 6,957 pages; l2: 15,376 pages |

ORCA reads **26x** more l3 data and **721x** more l2 data than PG.

**Root cause: missing Semi/Anti-Semi Index NL Join operators**

ORCA's physical operator set includes:

| Operator | Exists? |
|----------|---------|
| `EopPhysicalInnerIndexNLJoin` | yes |
| `EopPhysicalLeftOuterIndexNLJoin` | yes |
| `EopPhysicalLeftSemiIndexNLJoin` | **no** |
| `EopPhysicalLeftAntiSemiIndexNLJoin` | **no** |

`CXformImplementIndexApply` (the transform that instantiates physical Index NL Join operators)
only produces `CPhysicalInnerIndexNLJoin` and `CPhysicalLeftOuterIndexNLJoin`. It has no
branches for semi or anti-semi variants.

Because these operators don't exist, ORCA **can never** use an index to probe the inner side
of an EXISTS / NOT EXISTS subquery. The only available physical alternatives are
`CPhysicalLeftSemiHashJoin` and `CPhysicalLeftAntiSemiHashJoin`, both of which require
materializing the entire inner relation into a hash table — hence the full scan of `lineitem`
(6M rows for l2, 3.8M for l3) and the resulting disk spill.

**Pattern shared with Q4:** Q4's EXISTS regression has the identical structural cause: the
correlated EXISTS subquery is unnested into a full scan + dedup rather than an indexed
semi-join, for the same reason — no `EopPhysicalLeftSemiIndexNLJoin` exists to exploit
`lineitem_pkey`.

**Fix direction:** Implement `CPhysicalLeftSemiIndexNLJoin` and
`CPhysicalLeftAntiSemiIndexNLJoin` physical operators (mirroring the inner/outer variants),
and extend `CXformImplementIndexApply` to instantiate them. This is a significant effort
requiring new physical operator classes, cost model entries, and DXL translator support, but
would address Q4, Q21, and any other query using EXISTS/NOT EXISTS with an indexed inner key.