# pgmnemo Scientific-Technical Release Process **Effective:** 2026-05-10 **Authority:** Founder directive, formalized by engineering + research working group **Applies to:** Every minor and major release (v0.X.Y where X or Y increments) --- ## 1. Mandate Every release of pgmnemo must be backed by: 1. Full benchmark reports on **all mandatory benchmarks** (§2) 2. Working Group (WG) review and sign-off (§4) 3. Statistical significance analysis on every claimed improvement (§3) 4. A written decision document: Ship or Hold (§5) **No version tag is cut until all four are complete.** --- ## 2. Benchmark Mandate ### 2.1 Required Benchmarks (every minor + major release) | Benchmark | Dataset | Metric focus | Notes | |-----------|---------|--------------|-------| | LoCoMo | `snap-research/locomo` (locomo10.json, pinned SHA) | recall@5/10/25/50, MRR, per-category | Run full 1982+ questions, not sampled | | LongMemEval-S | `xiaowu0162/longmemeval-cleaned` (longmemeval_s_cleaned.json, pinned SHA) | recall@1/5/10/20, MRR, per-qtype | n≥500 | ### 2.2 Additional Benchmarks (when applicable) | Benchmark | Trigger | |-----------|---------| | HippoRAG | If graph-based retrieval changes | | MemoryBank | If episodic memory architecture changes | | Custom domain eval | If new vertical-specific feature is added | ### 2.3 Dataset Pinning - Every run records `dataset_sha256` of the raw file - If upstream dataset changes between versions, the deviation is documented in the Methodology Changes section of the release notes --- ## 3. Statistical Reporting Requirements ### 3.1 All Metrics Reported Every benchmark run reports **all** of the following metrics — no cherry-picking: - recall@1, recall@5, recall@10, recall@20 (LongMemEval); recall@5, recall@10, recall@25, recall@50 (LoCoMo) - MRR (Mean Reciprocal Rank) - NDCG (Normalized Discounted Cumulative Gain) — where ground-truth supports it - Per-category / per-qtype breakdowns ### 3.2 Confidence Intervals - **95% Wilson confidence intervals** on all proportion metrics - Reported as `[lo, hi]` alongside the point estimate - Never report a point estimate without a CI ### 3.3 Pairwise Significance Tests For every metric, compare current version vs. immediately previous version: - **Two-proportion z-test** (for recall@k, which are proportions): ``` p_pool = (k1/n1 + k2/n2) / 2 SE = sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2)) z = (p2 - p1) / SE p_two_tailed = 2 * (1 - Phi(|z|)) ``` - **Paired t-test or Wilcoxon signed-rank** for MRR (continuous values) if per-query scores are available - Report: `delta`, `z`, `p_raw`, `p_corrected`, `significant (yes/no)` ### 3.4 Multiple Comparisons Correction - Apply **Holm-Bonferroni correction** across all pairwise tests in a single report - Report both `p_raw` and `p_corrected` - A result is "significant" only if `p_corrected < 0.05` ### 3.5 Effect Size - **Cohen's h** for proportion differences: ``` h = 2 * arcsin(sqrt(p2)) - 2 * arcsin(sqrt(p1)) ``` Interpretation: |h| < 0.2 small, 0.2–0.5 medium, > 0.5 large - Report alongside every z-test result ### 3.6 Tooling Use `scripts/significance_test.py` for all statistical computations. Input: two `metrics.json` files. Output: full comparison table with CIs, z-scores, p-values, Holm corrections, and Cohen's h. --- ## 4. Working Group (WG) Review Gate ### 4.1 WG Composition | Role | Responsibility | |------|---------------| | PI (Principal Investigator) | Final ship/hold authority | | Chief Architect | Validity of implementation and methodology | | StatAnalyst | Independent re-derivation of all statistical claims | | ResSup (Research Supervisor) | Threat-to-validity assessment, benchmark integrity | ### 4.2 Review Process 1. Author produces draft benchmark report using `benchmarks/REPORT_TEMPLATE.md` 2. `scripts/significance_test.py` run against current vs. previous `metrics.json` — output appended to report 3. Draft circulated to WG at least **48 hours** before proposed tag date 4. StatAnalyst independently re-derives key statistics (different seed if simulation involved) 5. Each WG member signs off in the report's **WG Sign-off** section 6. **All four signatures required** before tag is cut ### 4.3 Quorum Exception If one WG member is unavailable, PI may grant a 3-of-4 quorum exception, documented in the report. --- ## 5. Decision Matrix: Ship vs. Hold | Primary metric (recall@10) | Secondary metric (MRR) | Decision | Rationale | |---------------------------|----------------------|----------|-----------| | Significant improvement (p_corr < 0.05) | Significant improvement | **SHIP** | Clear win | | Significant improvement | Non-significant / neutral | **SHIP with caveat** | Lead metric improved; note MRR stability | | Non-significant | Significant improvement | **CONDITIONAL SHIP** | Must assess whether MRR alone justifies claim; see §5.1 | | Non-significant | Non-significant | **HOLD or SHIP as no-claim** | Ship only if no performance claims made; document as "neutral" | | Any metric regressed significantly | — | **HOLD** | Regression must be resolved or explicitly accepted with rationale | ### 5.1 Conditional Ship Criteria A feature may ship with "Conditional" status when primary recall metric is non-significant but secondary metrics are significant, if ALL of the following hold: 1. No primary metric regressed significantly 2. The feature has a strong theoretical justification for MRR gain without recall gain 3. The release notes accurately reflect which metrics are/are not significant 4. WG unanimous agreement (no quorum exception) ### 5.2 Prohibited Claims - Never claim improvement on a metric where `p_corrected ≥ 0.05` - Never report only the best-performing metric subset - "~X pp improvement" claims require citing the specific metric, CI, and p-value --- ## 6. Public Release Notes Structure Each release's public-facing notes must have these exact sections: ### Significant Improvements Only metrics where `p_corrected < 0.05` after Holm-Bonferroni correction. Format: `metric: +Xpp (95% CI [lo, hi], p=Y, h=Z)` ### Marginal / Non-Significant Changes Metrics that changed within statistical noise. Format: `metric: +Xpp (95% CI, p=Y ns)` ### Regressions Any metric that worsened, whether significant or not. Format: `metric: -Xpp (95% CI, p=Y)` ### Methodology Changes Any deviation from the previous run's methodology (embedder, dataset version, retrieval formula, etc.). ### Benchmark Integrity Dataset SHA256s, run environment, wall-clock time, device. --- ## 7. Versioning of This Process This document is versioned alongside pgmnemo. Breaking changes to the process require: - Founder approval - A new section in this document dated and initialed - Retroactive tagging of any releases that used the prior process --- ## Appendix A: Process Checklist (per release) ``` [ ] All mandatory benchmarks run on final code [ ] metrics.json files produced with pinned dataset SHA256 [ ] significance_test.py run: current vs. previous metrics.json [ ] Draft report written using REPORT_TEMPLATE.md [ ] Draft shared with WG ≥48h before tag [ ] StatAnalyst independent re-derivation complete [ ] All 4 WG signatures collected (or documented quorum exception) [ ] Decision matrix applied, ship/hold documented [ ] Public release notes follow §6 structure [ ] No prohibited claims (§5.2) in any public-facing text [ ] Git tag cut only after all above ```