# pgmnemo Scientific-Technical Release Process

**Effective:** 2026-05-10  
**Authority:** Founder directive, formalized by engineering + research working group  
**Applies to:** Every minor and major release (v0.X.Y where X or Y increments)

---

## 1. Mandate

Every release of pgmnemo must be backed by:

1. Full benchmark reports on **all mandatory benchmarks** (§2)
2. Working Group (WG) review and sign-off (§4)
3. Statistical significance analysis on every claimed improvement (§3)
4. A written decision document: Ship or Hold (§5)

**No version tag is cut until all four are complete.**

---

## 2. Benchmark Mandate

### 2.1 Required Benchmarks (every minor + major release)

| Benchmark | Dataset | Metric focus | Notes |
|-----------|---------|--------------|-------|
| LoCoMo | `snap-research/locomo` (locomo10.json, pinned SHA) | recall@5/10/25/50, MRR, per-category | Run full 1982+ questions, not sampled |
| LongMemEval-S | `xiaowu0162/longmemeval-cleaned` (longmemeval_s_cleaned.json, pinned SHA) | recall@1/5/10/20, MRR, per-qtype | n≥500 |

### 2.2 Additional Benchmarks (when applicable)

| Benchmark | Trigger |
|-----------|---------|
| HippoRAG | If graph-based retrieval changes |
| MemoryBank | If episodic memory architecture changes |
| Custom domain eval | If new vertical-specific feature is added |

### 2.3 Dataset Pinning

- Every run records `dataset_sha256` of the raw file
- If upstream dataset changes between versions, the deviation is documented in the Methodology Changes section of the release notes

---

## 3. Statistical Reporting Requirements

### 3.1 All Metrics Reported

Every benchmark run reports **all** of the following metrics — no cherry-picking:

- recall@1, recall@5, recall@10, recall@20 (LongMemEval); recall@5, recall@10, recall@25, recall@50 (LoCoMo)
- MRR (Mean Reciprocal Rank)
- NDCG (Normalized Discounted Cumulative Gain) — where ground-truth supports it
- Per-category / per-qtype breakdowns

### 3.2 Confidence Intervals

- **95% Wilson confidence intervals** on all proportion metrics
- Reported as `[lo, hi]` alongside the point estimate
- Never report a point estimate without a CI

### 3.3 Pairwise Significance Tests

For every metric, compare current version vs. immediately previous version:

- **Two-proportion z-test** (for recall@k, which are proportions):

  ```
  p_pool = (k1/n1 + k2/n2) / 2
  SE = sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
  z = (p2 - p1) / SE
  p_two_tailed = 2 * (1 - Phi(|z|))
  ```

- **Paired t-test or Wilcoxon signed-rank** for MRR (continuous values) if per-query scores are available

- Report: `delta`, `z`, `p_raw`, `p_corrected`, `significant (yes/no)`

### 3.4 Multiple Comparisons Correction

- Apply **Holm-Bonferroni correction** across all pairwise tests in a single report
- Report both `p_raw` and `p_corrected`
- A result is "significant" only if `p_corrected < 0.05`

### 3.5 Effect Size

- **Cohen's h** for proportion differences:
  ```
  h = 2 * arcsin(sqrt(p2)) - 2 * arcsin(sqrt(p1))
  ```
  Interpretation: |h| < 0.2 small, 0.2–0.5 medium, > 0.5 large

- Report alongside every z-test result

### 3.6 Tooling

Use `scripts/significance_test.py` for all statistical computations. Input: two `metrics.json` files. Output: full comparison table with CIs, z-scores, p-values, Holm corrections, and Cohen's h.

---

## 4. Working Group (WG) Review Gate

### 4.1 WG Composition

| Role | Responsibility |
|------|---------------|
| PI (Principal Investigator) | Final ship/hold authority |
| Chief Architect | Validity of implementation and methodology |
| StatAnalyst | Independent re-derivation of all statistical claims |
| ResSup (Research Supervisor) | Threat-to-validity assessment, benchmark integrity |

### 4.2 Review Process

1. Author produces draft benchmark report using `benchmarks/REPORT_TEMPLATE.md`
2. `scripts/significance_test.py` run against current vs. previous `metrics.json` — output appended to report
3. Draft circulated to WG at least **48 hours** before proposed tag date
4. StatAnalyst independently re-derives key statistics (different seed if simulation involved)
5. Each WG member signs off in the report's **WG Sign-off** section
6. **All four signatures required** before tag is cut

### 4.3 Quorum Exception

If one WG member is unavailable, PI may grant a 3-of-4 quorum exception, documented in the report.

---

## 5. Decision Matrix: Ship vs. Hold

| Primary metric (recall@10) | Secondary metric (MRR) | Decision | Rationale |
|---------------------------|----------------------|----------|-----------|
| Significant improvement (p_corr < 0.05) | Significant improvement | **SHIP** | Clear win |
| Significant improvement | Non-significant / neutral | **SHIP with caveat** | Lead metric improved; note MRR stability |
| Non-significant | Significant improvement | **CONDITIONAL SHIP** | Must assess whether MRR alone justifies claim; see §5.1 |
| Non-significant | Non-significant | **HOLD or SHIP as no-claim** | Ship only if no performance claims made; document as "neutral" |
| Any metric regressed significantly | — | **HOLD** | Regression must be resolved or explicitly accepted with rationale |

### 5.1 Conditional Ship Criteria

A feature may ship with "Conditional" status when primary recall metric is non-significant but secondary metrics are significant, if ALL of the following hold:

1. No primary metric regressed significantly
2. The feature has a strong theoretical justification for MRR gain without recall gain
3. The release notes accurately reflect which metrics are/are not significant
4. WG unanimous agreement (no quorum exception)

### 5.2 Prohibited Claims

- Never claim improvement on a metric where `p_corrected ≥ 0.05`
- Never report only the best-performing metric subset
- "~X pp improvement" claims require citing the specific metric, CI, and p-value

---

## 6. Public Release Notes Structure

Each release's public-facing notes must have these exact sections:

### Significant Improvements
Only metrics where `p_corrected < 0.05` after Holm-Bonferroni correction.  
Format: `metric: +Xpp (95% CI [lo, hi], p=Y, h=Z)`

### Marginal / Non-Significant Changes
Metrics that changed within statistical noise.  
Format: `metric: +Xpp (95% CI, p=Y ns)`

### Regressions
Any metric that worsened, whether significant or not.  
Format: `metric: -Xpp (95% CI, p=Y)`

### Methodology Changes
Any deviation from the previous run's methodology (embedder, dataset version, retrieval formula, etc.).

### Benchmark Integrity
Dataset SHA256s, run environment, wall-clock time, device.

---

## 7. Versioning of This Process

This document is versioned alongside pgmnemo. Breaking changes to the process require:
- Founder approval
- A new section in this document dated and initialed
- Retroactive tagging of any releases that used the prior process

---

## Appendix A: Process Checklist (per release)

```
[ ] All mandatory benchmarks run on final code
[ ] metrics.json files produced with pinned dataset SHA256
[ ] significance_test.py run: current vs. previous metrics.json
[ ] Draft report written using REPORT_TEMPLATE.md
[ ] Draft shared with WG ≥48h before tag
[ ] StatAnalyst independent re-derivation complete
[ ] All 4 WG signatures collected (or documented quorum exception)
[ ] Decision matrix applied, ship/hold documented
[ ] Public release notes follow §6 structure
[ ] No prohibited claims (§5.2) in any public-facing text
[ ] Git tag cut only after all above
```