# TPC-H Q17 Root Cause Analysis **Status**: Resolved on this branch. ORCA selects PG-style correlated NL+IndexScan at SF=1/10 and beats PG by 22× / 78× respectively. The "unnest + global HashAggregate" plan that previously dominated has been suppressed by reverting two cost-model commits that were originally intended to fix Q20 but in practice broke Q17. ## Query ```sql SELECT sum(l_extendedprice) / 7.0 AS avg_yearly FROM lineitem, part WHERE p_partkey = l_partkey AND p_brand = 'Brand#23' AND p_container = 'MED BOX' AND l_quantity < (SELECT 0.2 * avg(l_quantity) FROM lineitem WHERE l_partkey = p_partkey); ``` The inner scalar subquery is **correlated** on `p_partkey` — for each qualifying outer part it computes the per-part average lineitem quantity. ## Cardinalities | | SF=1 | SF=10 | |------------------------------------|---------:|---------:| | `part` total | 200,000 | 2,000,000 | | `part` after `brand` + `container` | **204** | **2,044** | | `lineitem` total | 6,001,215 | 59,986,052 | | `lineitem.l_partkey` distinct | 200,000 | 2,000,000 | `part` is filtered to **0.1%** by the brand + container conjunction. With ~30 lineitems per partkey, the *correlated* plan touches ~6,000 lineitem rows at SF=1 and ~60,000 at SF=10 — vs ~6M / ~60M for any plan that scans lineitem in full. ## Two Plan Shapes ### "Correlated NL + IndexScan" (PG, ORCA on this branch) ``` Aggregate └── NL outer × inner │ ├── NL Left Join (part × per-part avg) 2,044 outer rows │ ├── Seq Scan part (filter brand+container) │ └── HashAggregate per-call, ~30 rows in / 1 row out │ └── IndexScan lineitem_partkey_idx ~30 rows / probe │ └── IndexScan lineitem_partkey_idx ~3 rows / probe (after avg-filter) Filter: l_quantity < (avg from outer) ``` Total lineitem touches: `2,044 × 30 = ~60,000`. Wall clock at SF=10: **~1.3 s**. ### "Eager unnest + global HashAgg" (origin/main before this work) ``` Aggregate └── Hash Join 6M outer × 200k inner │ Filter: l_quantity < 0.2*avg ├── Seq Scan lineitem 60M rows, full scan └── Hash on: Hash Right Join 200k pre-aggregated rows ├── HashAggregate (60M → 2M groups) spill 21 batches, 1.5 GB temp │ └── Seq Scan lineitem 60M rows, second full scan └── Hash on Seq Scan part 2,044 rows after filter ``` Total lineitem touches: `60M + 60M = 120M`. Wall clock at SF=10: **~100 s** (mostly the spilled HashAgg). ## Performance Across Scale Factors | | ORCA (origin/main) | ORCA (this branch) | PG | this branch speedup | |-----|-------------------:|-------------------:|-------:|--------------------:| | SF=1 | 7,000 ms | **76 ms** | 1,800 ms | **22.4×** ✅ | | SF=10 | 100,100 ms | **1,280 ms** | 25,000 ms | **19.5×** ✅ | | SF=20 | 214,500 ms | (not measured) | 57,000 ms | (extrapolated ~22×) | PG's plan is identical across SF (correlated SubPlan + Bitmap Index Scan); its time grows because heap fetch surface area grows with SF. ORCA's plan on this branch is also identical across SF and benefits from the same scaling: `K_outer × per_probe` is a constant, but per-probe heap work grows slowly in linear fashion. ## What Changed On This Branch Two commits originally introduced to fix Q20 had the side effect of making Q17's correlated-apply path appear far more expensive than its unnest alternative. Both have been **reverted** here: | Reverted Commit | Description | |-----------------|----------------------------------------------------------------| | `0fb8d27` | *fix: propagate NumRebinds through CGroupByStatsProcessor::CalcGroupByStats* | | `79e9bd3` | *fix: correctly price correlated IndexNLJoin inner child cost for Q20* | Why they hurt Q17: - `0fb8d27` made `CLogicalGbAgg` over a correlated `CPhysicalIndexScan` carry `NumRebinds = outer_rows` instead of `1`. This causes `CostIndexNLJoin` to multiply the inner-subtree cost by ~5,000× when the apply has ~5,000 outer rows. For Q17 the inner-subtree cost represents the per-call HashAgg over IndexScan — already an honest per-rebind cost — so the extra multiplication double-counts and inflates the correlated path far above the unnest path. - `79e9bd3` independently adds `(num_rows_outer - 1) × inner_cost` in `CostIndexNLJoin` when `pdRebinds[1] == GPOPT_DEFAULT_REBINDS`. After `0fb8d27`, `pdRebinds[1] != 1` already (it's `outer_rows`), so this guard never fires for Q17. But for the unnest path (different join group), it adds nothing. Net effect: amplifies the correlated-path penalty without any matching adjustment for the unnest path. Empirical results (`Q20 SF=1`): | Q20 SF=1 | wall clock | |-----------------------------|-----------:| | origin/main (with both fixes) | 3,591 ms | | this branch (both reverted) | **787 ms** ✅ | | this branch + A-group OLS calibration | **787 ms** | The "Q20 fix" commits actually **regressed** Q20 SF=1 by 4.5×. The real fix for Q20 SF=1 is the OLS calibration of NL inner IndexScan costs (cherry-picked here from the `tpch` branch as commits `0d5956a`, `9294709`, `250f513`). ## OLS-Calibrated Cost Model (Cherry-Picked from `tpch`) The three cherry-picks together replace the MPP-era cost-model constants with values derived from warm-cache PG 18 OLS measurements on TPC-H lineitem index probes: | Param | MPP default | This branch (NL inner) | |-------------------------------|-------------:|-----------------------:| | `dEffectiveRandomFactor` | 0.05 | **4.259e-3** (×0.085) | | `dEffectiveScanTupCostUnit` | 3.66e-6 | **1.725e-6** (×0.471) | Derived from: - α = 4.914e-3 ms / probe (B-tree traversal, warm cache) - β = 6.628e-6 ms / (probe·row·byte) - C_unit = 1.154 ms / cost-unit Activated only when **NumRebinds > 1** AND the predicate covers the **leading index column**. The non-leading-column path keeps the original penalty so plans probing a non-prefix column remain correctly discouraged. ## Net Bench Effect (SF=1, 22 queries × 3 runs) | | origin/main | this branch | Δ | |------------------|------------:|-------------:|-----------:| | Total ORCA | 78,718 ms | **63,494 ms** | **−19%** | | Total PG | 61,265 ms | 60,608 ms | −1% | | ORCA / PG | 1.28× | **1.05×** | **−18%** | Per-query highlights (origin/main → this branch): | Q | origin/main | this branch | Δ | speedup change | |----:|------------:|------------:|---------:|---------------:| | 4 | 4,501 ms | 1,421 ms | **−68%** | 0.26× → 0.83× | | 9 | 4,491 ms | 5,808 ms | +29% ⚠ | 2.02× → 1.57× | | 10 | 3,021 ms | 1,711 ms | **−43%** | 1.00× → 1.77× | | 14 | 2,289 ms | 2,677 ms | +17% ⚠ | 1.01× → 0.86× | | **17** | **7,155 ms** | **76 ms** | **−99%** | 0.24× → **22.44×** | | **20** | **3,591 ms** | **787 ms** | **−78%** | 0.23× → **1.07×** | | Other (16 of 22) | < ±5% | within noise | | | Two queries regressed (Q9 by ~1.3 s, Q14 by ~0.4 s) but both **still beat or match PG** (1.57× and 0.86×). The 14 s saved on Q17/Q20/Q4/Q10 dwarfs the 1.7 s lost on Q9/Q14. ## Why The "Q20 Fix" Was Worse Than No Fix Both reverted commits were authored after a Q20 timeout regression appeared in the SF=5 internal bench. They successfully made the *correlated apply* path more expensive than the *decorrelated hash join* path for Q20 at SF=5 (commit `79e9bd3` reports "30 min → 7 s"). At the same time, however, they made the same correlated-apply path several orders of magnitude more expensive for **Q17 at every SF** and for **Q20 at SF=1**, because ORCA's cost model has no way to distinguish *Q17's correlated apply with cheap per-probe IndexScan inner* from *Q20's correlated apply with expensive full-scan inner*. The OLS calibration cherry-picked here addresses the underlying issue more cleanly: it makes the per-probe IndexScan cost itself accurate, so the cost difference between Q17's correlated path (cheap probes) and Q20's correlated path (expensive scans) emerges naturally without needing the broad NumRebinds-based amplification. Empirically the OLS calibration alone fixes both Q17 and Q20 at SF=1, and is sufficient for Q17 at SF=10 too. Q20 at SF=10 is roughly break-even between the two configurations (both ~9 s). ## Reproduction ```bash # At repo root, on this branch: ls test/bench/tpch_bench.sh # bench harness PG_CONFIG=$PG_CONFIG bash test/bench/tpch_bench.sh tpch 3 PG_CONFIG=$PG_CONFIG bash test/bench/tpch_bench.sh tpch_sf10 3 ``` To inspect the correlated NL plan: ```sql LOAD 'pg_orca'; SET pg_orca.enable_orca = on; SET max_parallel_workers_per_gather = 0; EXPLAIN (ANALYZE, BUFFERS, COSTS OFF) SELECT sum(l_extendedprice) / 7.0 AS avg_yearly FROM lineitem, part WHERE p_partkey = l_partkey AND p_brand = 'Brand#23' AND p_container = 'MED BOX' AND l_quantity < (SELECT 0.2 * avg(l_quantity) FROM lineitem WHERE l_partkey = p_partkey); ``` Expected plan top: `Aggregate → Nested Loop → ... → Index Scan using lineitem_partkey_idx` (twice, once for the per-part avg subquery and once for the outer join). ## Open Questions / Future Work 1. **Q9 / Q14 minor regressions**: both still beat PG; root cause is the OLS-calibrated NL cost making one IndexNL look cheap enough to replace a HashJoin that was actually faster. May be addressable by a small additional discount to HashJoin build cost when build side is small (< 1M rows). 2. **Q20 SF=10**: ~9 s on either configuration. The unnest path scales linearly with lineitem; correlated path scales linearly with outer rows × per-probe lineitem read. Both are ~10× slower than SF=1, no pathological scaling. Improving SF=10 Q20 would require multi-column selectivity for the `partsupp.ps_partkey IN (part filter)` semi-join. 3. **Whether to retain `risk_threshold` mechanism**: the current `optimizer_index_join_allowed_risk_threshold = 3` is unchanged. With the OLS calibration in place, the risk multiplier is no longer load-bearing for the queries we tested. A future change could either raise the default to 4 (small SF=1 net positive, see `Q9 / Q10` trade-off above) or remove the multiplier entirely if it can be shown to never fire under OLS-calibrated costs.