# D2: Drop Stale Rebind Caching for Correlated Index Scans Commit: `437f7fe cost: drop stale rebind caching for correlated index scans` ## Problem Inner stats for a correlated subtree (e.g. `IndexGet(l2 on l_orderkey = outer.col)`) used to set `NumRebinds = outer_rows` at stats-derive time. The outer cardinality was baked into the group-shared inner stats. After join reorder placed the inner under a different (smaller) outer, the cached rebinds became stale, and `CostIndexNLJoin` multiplied per-probe cost by the original (large) outer row count. ### Concrete instance — Q21 SF=1 `SemiJoin(supplier ⋈ nation ⋈ l1, l2)` had: - Actual outer rows (group 724): **96,010** - Cached inner `NumRebinds` (group 56, l1_filtered): **2,400,253** - LeftSemiIndexNLJoin cost: 31,125 — **6.5× more expensive than HashSemi (4,747)** The optimizer therefore picked Hash Semi Join + full table scan of l2 (6M rows) instead of NL + Index probe on `idx_lineitem_orderkey`. PG's planner picked the NL+Index path and was 6.4× faster. ## Fix Three coordinated changes: 1. **`CJoinStatsProcessor::DeriveStatsWithOuterRefs`** — no longer calls `SetRebinds(num_rows_outer)`. Inner stats keep rebinds at the default 1. `Rows()` represents per-probe output; the actual NL outer cardinality is supplied at cost time. 2. **`CostIndexNLJoin`** — extends rebind-compensation block to all IndexNL variants (Inner / LeftOuter / LeftSemi / LeftAntiSemi). Previously only `InnerIndexNLJoin` compensated; the others assumed cached inner rebinds equaled outer rows, which is no longer true. 3. **`CostIndexScan`** — switches NL-inner detection in the warm-cache OLS path and the heap-fetch term from `pci->NumRebinds() > 1` to `exprhdl.HasOuterRefs()`. Cached rebinds are now always 1, so the rebinds check would otherwise classify every correlated probe as a cold standalone scan and add ~100× heap-fetch cost. ## Benchmark Methodology - Hardware: macOS 14.5, ARM64, 16 GB RAM - PostgreSQL 18 (single-node), warm cache - `max_parallel_workers_per_gather = 0` - `optimizer_index_join_allowed_risk_threshold = 3` (default) - Median of 3 runs per query (warm-up discarded) - Fresh `ANALYZE` before each scale-factor run ## Results | SF | Baseline (no D2) | D2 (this fix) | Δ vs baseline | PG Native | ORCA/PG | |---|---|---|---|---|---| | **SF=1** | 68,933 ms | **58,510 ms** | **−15.1%** | 42,429 | 1.38× | | **SF=2** | 188,816 ms | **167,583 ms** | **−11.2%** | 246,139 | **0.68×** | | **SF=5** | 712,810 ms | **627,078 ms** | **−12.0%** | 382,880 | 1.64× | SF=2 PG total is inflated by PG's own poor plan choices on Q5 (62 s) and Q7 (46 s); ORCA beats PG on those. ## Per-Query Highlights (SF=5) ### Significant improvements (>20%) | Query | Baseline | D2 | Δ | |---|---|---|---| | Q19 | 5,414 ms | **653 ms** | **−87.9%** | | Q17 | 3,211 ms | **423 ms** | **−86.8%** | | Q3 | 41,519 ms | **15,546 ms** | **−62.6%** | | Q2 | 10,420 ms | **4,911 ms** | **−52.9%** | | Q22 | 8,102 ms | **5,328 ms** | **−34.2%** | | Q11 | 2,772 ms | **1,873 ms** | **−32.4%** | | Q21 | 173,058 ms | **130,490 ms** | **−24.6%** | | Q7 | 16,761 ms | **12,863 ms** | **−23.3%** | ### Regressions | Query | Baseline | D2 | Δ | |---|---|---|---| | Q12 | 12,641 ms | 17,778 ms | **+40.6%** | | Q10 | 17,748 ms | 21,209 ms | +19.5% | | Q9 | 156,425 ms | 162,178 ms | +3.7% | Q12 is the only consistent regression across SF=2/5 (SF=1 it improves). Worth a follow-up investigation. ## Q21 across all SF | SF | Baseline | D2 | Improvement | |---|---|---|---| | SF=1 | 10,225 ms | **5,472 ms** | −46.5% | | SF=2 | 33,866 ms | **19,494 ms** | −42.4% | | SF=5 | 173,058 ms | **130,490 ms** | −24.6% | ## Threshold Sensitivity Tested `optimizer_index_join_allowed_risk_threshold = 10` on SF=5 — total time went from 641 s to 1,096 s (+71%). Q7 alone regressed from 14 s to 242 s. The default value 3 is correct for the general case; users can opt into 10 at the session level for plans that match Q21's shape. ## Files Changed - `libgpdbcost/src/CCostModelGPDB.cpp` — `CostIndexNLJoin` and `CostIndexScan` - `libnaucrates/src/statistics/CJoinStatsProcessor.cpp` — `DeriveStatsWithOuterRefs` Total +20 lines, −10 lines.