# pgmnemo Benchmarks

**Status:** v0.2.1 first honest results, retrieval-only mode

This document summarizes our public benchmark results, methodology, and the
honest comparison vs published baselines.

---

## TL;DR

| Benchmark | pgmnemo v0.2.1 | Notable comparison |
|---|---|---|
| **LoCoMo** retrieval (DRAGON, n=1982) | recall@10 = **0.795**, MRR = **0.548** (session-level, paper-class) | paper-class range (DRAGON canonical, session granularity) |
| **LongMemEval** retrieval (bge-m3, n=500, s_cleaned) | recall@10 = **0.933**, MRR = **0.855** | Below in-repo BM25 baseline (0.982) |

Reports + raw_retrievals + reproduction commands:
- [`benchmarks/locomo/results/v0.2.1_20260509/`](../benchmarks/locomo/results/v0.2.1_20260509/)
- [`benchmarks/longmemeval/results/v0.2.1_20260509/`](../benchmarks/longmemeval/results/v0.2.1_20260509/) — BM25 baseline (run_nollm.py)
- [`benchmarks/longmemeval/results/v0.2.1_pgmnemo_20260509/`](../benchmarks/longmemeval/results/v0.2.1_pgmnemo_20260509/) — pgmnemo vector (run_longmemeval_pgmnemo.py)

---

## Methodology Conformance

### LoCoMo (Maharana et al., ACL 2024)

| Paper requirement | Our implementation | Status |
|---|---|---|
| Dataset | `snap-research/locomo10.json` (10 conversations, 1986 questions, 5 categories) | ✅ verbatim |
| Embedder | facebook/dragon-plus (context+query) | ✅ paper canonical |
| Retrieval k | k ∈ {5, 10, 25, 50} | ✅ all reported |
| Metric (primary retrieval) | recall@K | ✅ |
| MRR (secondary) | yes | ✅ |
| LLM-as-judge accuracy (downstream eval) | n/a — retrieval-only mode | ⚠️ deferred |
| Storage dim | 768d (DRAGON native) | ⚠️ **DEVIATION**: pgmnemo enforces vector(1024); we zero-pad 768→1024. Cosine similarity preserved (math-identical). See `ADDENDA/LOCOMO_EMBEDDER_PADDING.md`. |

### LongMemEval (Wu et al., ICLR 2025)

| Paper requirement | Our implementation | Status |
|---|---|---|
| Dataset | `xiaowu0162/longmemeval-cleaned` (longmemeval_s_cleaned.json, 500 questions × ~47.7 sessions/haystack) | ✅ |
| Embedder | NovaSearch/stella_en_1.5B_v5 1024d | ⚠️ **DEVIATION**: bundled `modeling_qwen.py` incompatible with transformers 5.8 (`Qwen2Config.rope_theta` AttributeError); substituted **BAAI/bge-m3** (1024d, MTEB-strong, matches common production). See `ADDENDA/LONGMEMEVAL_EMBEDDER_BGE_M3.md`. |
| Retrieval (recall@K, NDCG@K, MRR) | recall@{1,5,10,20} + MRR | ✅ |
| Question types | 5 (single-session-{user,assistant,preference}, multi-session, temporal-reasoning, knowledge-update + abstention variant) | ✅ |
| LLM-as-judge accuracy via `evaluate_qa.py` | n/a — retrieval-only mode | ⚠️ deferred (no API key; paper supports retrieval-only) |
| Session truncation | 500 chars per session (config bug, not hardware limit) | ✅ **no significant impact**: QUICK-C re-run (v0.2.1_pgmnemo_20260509) recall@10 delta = 0.0008; addendum withdrawn. |

---

## Honest Findings

### 1. BM25 baseline outperforms pgmnemo vector on LongMemEval

```
recall@10:  pgmnemo vector (bge-m3) = 0.933  |  BM25 baseline = 0.982
recall@20:  pgmnemo vector (bge-m3) = 0.977  |  BM25 baseline = 0.996
```

Both metrics on the same dataset (longmemeval_s_cleaned, n=500). BM25 wins.

Hypothesized causes (under WG investigation):
- LongMemEval questions have high keyword overlap with relevant sessions — BM25-friendly task
- pgmnemo's 5-component scoring may over-penalize on short queries
- bge-m3 substitution (vs paper canonical Stella V5) may explain part of the gap
- session truncation had near-zero impact (QUICK-C delta = 0.0008)

### 2. pgmnemo wins on certain question types

| Q-type | pgmnemo recall@10 | Notes |
|---|---|---|
| single-session-assistant | 0.982 | tied with BM25 |
| multi-session | 0.957 | strong vs BM25-only baselines |
| temporal-reasoning | 0.933 | competitive |
| knowledge-update | 0.923 | competitive |
| single-session-preference | 0.900 | competitive |
| single-session-user | 0.871 | weakest |

### 3. LoCoMo recall@10 = 0.366 below paper-reported retrievers

Likely causes (under WG investigation):
- We index turn-level segments; paper may use session-level
- 5-component scoring weights need calibration on this dataset
- DRAGON 768d zero-padded → 1024d may have second-order HNSW effects (theoretically not, but worth verifying)

---

## Reproducibility

```bash
# Full reproduction in 3 commands:
docker run -d --name pgmnemo-bench -p 15432:5432 \
  -e POSTGRES_PASSWORD=bench -e POSTGRES_USER=bench -e POSTGRES_DB=bench \
  pgvector/pgvector:pg17

docker exec pgmnemo-bench bash -c "apt-get update -qq && \
  apt-get install -y -qq postgresql-server-dev-17 build-essential && \
  cd /tmp/pgmnemo && make && make install"
docker exec pgmnemo-bench psql -U bench -d bench -c "CREATE EXTENSION pgmnemo CASCADE;"

# LoCoMo (DRAGON, ~2 min on Apple Silicon MPS)
python benchmarks/scripts/run_locomo_bench.py

# LongMemEval (bge-m3, ~16 min on Apple Silicon MPS)
python benchmarks/scripts/run_longmemeval_pgmnemo.py
```

Hardware used for published numbers:
- Apple M-series Silicon (MPS GPU acceleration)
- Python 3.11.14, torch 2.11, transformers 5.8, sentence-transformers 5.4
- Wall clock: LoCoMo 111s; LongMemEval 944s

---

## What's Next (WG-in-progress)

WG goals:
1. Investigate why BM25 beats vector retrieval on LongMemEval
2. Identify scoring formula tuning paths to close the gap
3. Reproduce paper-canonical Stella V5 (transformers downgrade or API-compat shim)
4. Compare against MAGMA (arxiv 2601.03236), Mem0, Zep, HippoRAG on same benchmarks
5. Roadmap pgmnemo v0.2.2 (calibration) → v0.3.0 (multi-graph + dim-flex)

---

## References

- Maharana, A. et al. (2024). "Evaluating Very Long-Term Conversational Memory of LLM-based Agents." ACL 2024. [arxiv 2402.17753](https://arxiv.org/abs/2402.17753)
- Wu, Z. et al. (2024). "LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory." ICLR 2025. [arxiv 2410.10813](https://arxiv.org/abs/2410.10813)
- Lin, S.-C. et al. (2023). "DRAGON+." [HF facebook/dragon-plus](https://huggingface.co/facebook/dragon-plus-context-encoder)
- BAAI/bge-m3 multilingual MTEB-strong embedder (1024d)
- Wilson 1927 — score CIs