# pg_orca TODO

## Status snapshot (as of 2026-05-19, commit `29fc891`)

| Workload | ORCA vs PG (par=0) | ORCA vs PG (par=2) | Coverage edge | Source |
|---|---|---|---|---|
| TPC-H sf=10 | 1.12× geomean (ORCA total 636.7 s vs PG 657.9 s) | **0.49× geomean** (ORCA 750.4 s vs PG 328.2 s) | — | [`test/bench/results/tpch_sf10_three_way_summary.md`](test/bench/results/tpch_sf10_three_way_summary.md) |
| TPC-H sf=5  | 1.07× geomean | — | — | par=0 baseline |
| TPC-DS sf=5 | sum: ORCA 1458 s vs PG 1492 s (common 90) | — | **9 / 99 queries ORCA completes that PG times out (≥120 s)** | `test/bench/results/tpcds_tpcds_sf5_compare.csv` |
| TPC-DS sf=1 | ORCA 627 s vs PG 513 s (common 94) — ORCA 1.22× slower in total | — | **5 / 99 queries ORCA completes that PG times out (≥120 s)** | `test/bench/results/run_tpcds_sf1_20260509_1112.log` |

## Strengths — keep leveraging

1. **Correlated subquery decorrelation — ORCA's signature capability.**
   - ORCA rewrites `Apply` algebraically into a regular `Join` via the `CXformApply2Join` family (`libgpopt/src/xforms/CXformInnerApply2InnerJoin.cpp`, `CXformLeftSemiApply2LeftSemiJoin.cpp`, `CXformLeftOuterApply2LeftOuterJoin.cpp`, and ~10 sibling xforms). The result is a single optimizable plan; PG often falls back to a `SubPlan` / `InitPlan` executed once per outer row.
   - PG handles `EXISTS` / `NOT EXISTS` adequately; **everything else** (`= (SELECT ...)`, `IN (SELECT ...)`, nested correlation, subquery with aggregate) is where ORCA opens large gaps.
   - Wins (sample, par=0):

      | Query | Pattern | Speedup |
      |---|---|---:|
      | TPC-H Q17 | `l_quantity < 0.2 * AVG(...) WHERE p_partkey = ...` | **20.7×** (par=0) / 20.4× (par=2) |
      | TPC-DS Q41 | correlated IN + EXISTS | **156×** at sf=5 |
      | TPC-DS Q21 | correlated min/max with HAVING | 7.3× |
      | TPC-DS Q17 | correlated aggregate | 7.3× |
   - **Drives the TPC-DS coverage edge**: 7 of 9 PG-timeout queries at sf=5 (Q1, Q6, Q11, Q14, Q30, Q74, Q81 — Q4 and Q95 mix in window/roll-up) are correlated-subquery dominated.
   - **Orthogonal to parallelism** — Q17 still wins 20× at par=2. Structural plan-shape decision, not hardware throughput.

2. **Cost-based exhaustive join-order enumeration (DPv2).**
   - `libgpopt/src/xforms/CJoinOrderDPv2.cpp` + the commutativity / associativity xforms enumerate the full join space under a cost model that uses ORCA's cardinality estimates (not PG's). Result: shape selection (left-deep vs bushy vs star) is cardinality-driven, not heuristic.
   - PG comparison: `geqo_threshold = 12` switches to genetic algorithm above 12 relations; below it uses greedy DP. Neither path uses bushy plans aggressively, and both are sensitive to PG's row estimates.
   - Wins (sample, par=0):

      | Query | Pattern | Speedup |
      |---|---|---:|
      | TPC-H Q4  | 2-way + correlated EXISTS, anti-semi shape | 2.55× |
      | TPC-H Q10 | 4-way + filter + ORDER BY LIMIT | 1.71× |
      | TPC-H Q21 | 4-way + NOT EXISTS, anti-semi tree | 1.26× |
      | TPC-DS Q25 | 6-way roll-up join | 9.0× |
      | TPC-DS Q29 | 6-way roll-up join | 7.0× |
   - Controlled by `optimizer_join_order_threshold = 10` (in `gpopt/utils/COptTasks.cpp:402` — currently not exposed as a `pg_orca.*` GUC; raising it widens the search at the cost of more planning time).
   - **Orthogonal to parallelism** — join-tree shape decisions survive intact at par=2.

3. **Statistics propagation through GROUP BY / DISTINCT in CTE / subquery is correct.**
   - `CGroupByStatsProcessor::CalcGroupByStats` preserves the full input histogram (incl. NDV) on grouping columns; CTE consumer inherits stats via colid mapping.
   - On TPC-DS Q31 this gives **3.28× vs PG** at sf=5 (PG: 37.4 s vs ORCA: 11.4 s) — same root cause as upstream Richard Guo's 2026-04 patch (`stadistinct` propagation through GROUP BY in CTEs).
   - Once that patch lands in PG, this particular gap will narrow. Hunt for the next family of stats-propagation wins before then (multi-column NDV, cross-column correlation, correlated-subquery decorrelation).

4. **TPC-DS coverage edge — ORCA solves queries PG cannot complete.**
   - At `statement_timeout=120 s`, sf=5: ORCA finishes **9 / 99** queries that PG times out on (Q1, Q4, Q6, Q11, Q14, Q30, Q74, Q81, Q95). Same setting at sf=1: **5 / 99** (Q1, Q4, Q6, Q11, Q74). These are not "ORCA marginally faster" — they are **complex multi-join / CTE / window queries where PG's planner picks a plan so bad it never returns**.
   - **Full 99-query total once PG timeouts are counted (floor: 120 s each)**:

      | Suite | ORCA total (all 99) | PG total (lower bound) | ORCA edge |
      |---|---:|---:|---:|
      | TPC-DS sf=5 | 1,836 s (1,458 s common + 378 s on the 9 PG-timeouts) | ≥ 2,572 s (1,492 s + 9 × 120 s) | **≥ 1.40×** |
      | TPC-DS sf=1 | 627 s | ≥ 1,113 s (513 s + 5 × 120 s) | **≥ 1.77×** |

      `120 s` is the timeout floor; PG's true runtime on those queries is open-ended (could be many minutes — see Richard Guo's 25.6 min on Q31 in upstream master). The real ORCA edge is strictly larger than the numbers above.
   - Within the head-to-head subset at sf=5, ORCA's distribution is bimodal: many small losses (0.4–0.9×) but a few massive wins. Top wins (sf=5): **Q41 156×, Q25 9.0×, Q21 7.3×, Q17 7.3×, Q29 7.0×, Q31 5.8×, Q39 5.8×, Q55 4.9×**. Sum-of-times still nets in ORCA's favor on the common 90 (1458 s vs 1492 s).
   - These wins cluster around: multi-CTE self-joins (Q31/Q39), correlated subqueries (Q17/Q41), and roll-up aggregations (Q21/Q29). Same family as TPC-H Q17 — **structural plan-shape wins**.
   - Worst losses (sf=5): Q61 (0.001×, ORCA 1270 ms vs PG 0.6 ms — pure planning overhead on a trivial query, see weakness #1), Q82/85/37/91/92 (0.06–0.18×). These are short queries where ORCA's planning cost dominates, or specific plan-shape regressions worth case-by-case investigation.

5. **Single-thread baseline is the right benchmark for optimizer-quality work.** par=2 numbers reflect hardware throughput, not optimizer changes.

## Weaknesses — actively hurting us

### 1. Planning time ⚠️ kills short queries

**Impact**: ORCA's CBO is consistently heavier than PG planner.

| Workload | ORCA planning | PG planning |
|---|---:|---:|
| Simple point lookup (1-row return) | 13.7 ms | 0.25 ms |
| Simple aggregate over 1 table | 3.8 ms | 0.14 ms |
| TPC-H Q4 (Inner Join + EXISTS + GroupBy) | 169 ms | 5.5 ms |
| TPC-DS Q31 (2 CTEs, 6-way self-join) | 192 ms | 3.6 ms |

For OLTP / short-running queries (< 50 ms exec), the planning overhead **dominates total latency**. A point lookup under ORCA takes ~14 ms wall-clock where PG takes < 1 ms — even if the executed plan is identical.

**Why this matters**:
- Disqualifies pg_orca from any latency-sensitive workload (web request paths, transactional code).
- TPC-H/TPC-DS benchmarks are exec-bound so the overhead is amortized — production mixed workloads are not.

**Possible mitigation directions** (no commitment yet):
- [ ] **Query plan cache / shape-keyed memoization** — cache the optimized DXL plan keyed by a normalized query shape + statistics generation number. Most OLTP traffic is small numbers of distinct query shapes; reusing planning would cut amortized cost dramatically. Requires invalidation on stats / DDL changes (hook into PG's `invalidate_*_callbacks`).
- [ ] **Heuristic fast-path bypass** — let `pg_orca` short-circuit to PG planner for queries below a complexity threshold (single table, no joins, no aggregates, equality predicates on indexed columns). Add GUC `pg_orca.fast_path_threshold` (e.g. join_count). Cheapest win.
- [ ] **Profile ORCA's search loop**. The 169 ms / 192 ms numbers suggest the cost is in Memo group exploration + transformation rules. Identify the dominant xforms with a flamegraph; some may be disable-able for simple shapes via existing `optimizer.*` GUCs.
- [ ] **`optimizer_join_order_threshold` / `optimizer_search_limit`** — already exist in ORCA but not exposed as PG-level GUCs in pg_orca. Wire them through and measure.

### 2. Existing items

- [ ] Partitioned tables — translation layer for `PartitionedTable` scans / dynamic partition pruning (DXL has `PartitionSelector` but stubs in `compat/cdb/`).

## References

- TPC-H three-way analysis: `test/bench/results/tpch_sf10_three_way_summary.md`
- par=2 gap + engineering path: `test/bench/results/tpch_sf10_par2_summary.md`
- par=0 baseline: `test/bench/results/tpch_sf10_summary.md`
- TPC-DS Q31 stats propagation win: documented inline in 2026-05-19 conversation transcript