MemPalace — Verbatim Memory for LLM Agents, Benchmarked Honestly

TL;DR

Why it matters. LLM assistants lose all conversational context when a session ends; existing solutions summarize or extract information at write time, trading fidelity for compression. MemPalace stores conversation history verbatim and retrieves it with semantic search — no LLM required at index time.
What it proposes. A local-first hierarchical memory palace: conversation content is organized into wings, rooms, and drawers; retrieval uses embedding similarity with optional hybrid boosting; an MCP server exposes 29 tools to LLM agents.
Headline result. 96.6% R@5 on LongMemEval (raw mode, no LLM, no API key); 98.4% R@5 on the held-out hybrid pipeline — the highest published retrieval-recall numbers on this benchmark as of mid-2026.
Limit. Verbatim storage performs poorly on “noisy” and conditional-reasoning categories (43–57% R@5) where distractors are deliberately mixed with relevant content; temporal persona management across evolving conversation banks is unresolved.
What to keep. Raw verbatim text + good embeddings is a stronger retrieval baseline than extraction-based approaches assumed. Deciding what to keep at write time is the step where recall loss occurs.

Background

LLM memory research has converged on two paradigms. Extraction-based memory (Mem0, Zep, OpenAI’s memory feature) runs an LLM at write time to identify “important” information and discard the rest. Summarization-based memory collapses sessions into progressively shorter abstractions. Both paradigms make an irreversible decision at write time: what gets dropped stays dropped.

The failure mode is predictable. A user mentions their daughter’s allergy in turn 7 of session 3. An extraction pipeline might not flag this as “important”. A summarization pipeline compresses it into “user has a child”. When the user returns four sessions later and asks for a dinner recommendation, the specific allergy is gone from the retrievable store.

LongMemEval (2024) was designed to stress-test exactly this scenario: 500 questions spanning knowledge updates, multi-session reasoning, temporal queries, and single-session facts, evaluated against conversation histories of 100–500 turns. A system that extracts selectively will fail on questions whose answers depend on details the extractor deemed unimportant.

MemPalace’s premise is that the extraction decision belongs at query time, not write time. Store everything verbatim; decide relevance only when you know the question.

Core Idea

Palace hierarchy: two subgraphs (WING: Project, WING: Person) each containing Rooms connected by halls, Closets, and Drawers (verbatim), linked across wings by a tunnel bridge — **Fig. 1.** The palace hierarchy. Each wing (Project, Person) partitions content by context; rooms sub-partition by topic connected via halls; closets summarize and drawers hold verbatim text chunks. Cross-wing tunnel bridges enable traversal between related rooms in separate wings. Source: MemPalace docs — the-palace.md (rendered from Mermaid source)

The “palace” metaphor is structural, not cosmetic. The hierarchy maps to real retrieval constraints:

Wings are top-level partitions, typically one per project or long-running context. Scoping search to a wing eliminates cross-context noise without requiring the LLM to reason about it.
Rooms are topic-level partitions within a wing. The current default splits content into five conceptual halls (travel, work, health, relationships, general).
Drawers are the atomic storage unit — one per conversation session by default, or one per user/assistant message turn in fine-grained mode (mempalace sweep).

Content is stored as verbatim text. The only transformation at write time is chunking and embedding (sentence transformers, ~300 MB model, runs locally). No LLM is invoked, no API key is needed, nothing is discarded.

Method

Raw retrieval

The base retrieval layer is straightforward: embed the query, compute cosine similarity against all stored drawer embeddings, return the top-K results. No heuristics, no reranking. This “raw mode” achieves 96.6% R@5 on LongMemEval — meaning the correct session appears in the top-5 results 96.6% of the time across 500 diverse questions.

The strength of this baseline is itself informative. Most retrieval-augmented memory literature implicitly assumed that raw vector search would underperform, motivating elaborate extraction pipelines. MemPalace’s raw result suggests the assumption was wrong, or at least that the failure case (what does raw search miss?) is smaller than the field assumed.

Hybrid pipeline (v4)

The hybrid pipeline layers five additional signals on top of the raw embedding similarity:

Keyword overlap boost: exact terminology match between query and drawer receives a mild score increase.
Temporal proximity boost: drawers close in time to the query’s implied date receive up to 40% distance reduction.
Quoted phrase extraction: exact phrases in quotes receive 60% distance reduction.
Person name boosting: capitalized proper nouns receive 40% distance reduction.
Preference pattern extraction: 16 regex patterns detect implicit preferences (“I usually prefer X”, “I don’t like Y”) and boost matching drawers.

Trained on a 50-question dev set, the hybrid pipeline achieves 98.4% R@5 on the held-out 450 questions — the honest generalisable figure. The 50-question tuning set is explicitly flagged in the benchmark documentation to prevent confusion with generalization performance.

Knowledge graph

Alongside the vector store, MemPalace maintains a temporal entity-relationship graph backed by SQLite. Entities (people, projects, places) carry validity windows: relationships can be added, queried, invalidated, and displayed as timelines. This layer is for structured facts that need point-in-time retrieval rather than semantic similarity search.

MCP server and agent integration

Twenty-nine MCP tools cover palace reads and writes, knowledge-graph operations, cross-wing navigation, drawer management, and agent diaries. Each specialist agent gets its own wing and diary, discoverable at runtime via mempalace_list_agents without requiring a bloated system prompt. For Claude Code specifically, two hooks handle auto-save: one periodic hook and one pre-compression hook that fires before context compaction.

Benchmarks

LongMemEval per-category retrieval recall table: Knowledge update 99.0%, Multi-session 98.5%, Temporal reasoning 96.2%, Single-session user 95.7%, Single-session preference 93.3%, Single-session assistant 92.9% — all in raw mode with no LLM — **Fig. 2.** LongMemEval per-category R@5 and R@10 in raw mode (no LLM, no API key). Knowledge update and multi-session questions — the hardest long-horizon categories — score highest (99%+). Single-session preference (93.3%) and assistant questions (92.9%) are weakest, as they require inferring what the AI itself said rather than retrieving user-provided facts. Source: MemPalace benchmarks — benchmarks/BENCHMARKS.md

All benchmark numbers are reproducible from the public repository; full per-question result files are committed under benchmarks/results_*.jsonl. The benchmark methodology explicitly separates retrieval recall (R@K: does the correct session appear in top-K?) from end-to-end QA accuracy (does the system produce the correct answer?) — these are different metrics that prior comparison tables had conflated.

LongMemEval (500 questions):

Mode	R@5	LLM required
Raw (verbatim + embedding, no heuristics)	96.6%	None
Hybrid v4, held-out 450q	98.4%	None
Hybrid v4 + LLM rerank (full 500)	≥99%	Any capable model

The rerank pipeline promotes the best candidate out of the top-20 retrieved sessions using an LLM reader. The gap between raw and reranked is model-agnostic — reproduced with Claude Haiku, Claude Sonnet, and minimax-m2.7 via Ollama. The 99%+ number is not headlined because the last 0.4% was reached by inspecting specific wrong answers (benchmarks/BENCHMARKS.md flags this as “teaching to the test”).

LoCoMo (1,986 multi-hop questions):

Mode	R@10
Hybrid v5 (no rerank)	88.9%
Palace v2 (3 rooms, routing at index)	84.8%

ConvoMem (250 items, 5 categories): 92.9% average recall. Strongest on user facts (98.0%) and assistant facts (100%); weakest on preferences (86.0%).

MemBench — ACL 2025 (8,500 items): 80.3% R@5. Near-perfect on aggregative (99.3%) and comparative (98.4%) categories; weakest on noisy (43.4%) and conditional-reasoning (57.3%) categories, where distractors are deliberately introduced.

The MemBench weakness is the honest limit of verbatim storage: when irrelevant content is deliberately mixed in, higher recall of the full session also recalls the distractors. Filtering distractors requires reasoning, which raw embedding cannot do.

Limitations

Four limits worth flagging. First, verbatim storage plus noisy queries: the 43.4% R@5 on the MemBench “noisy” category reflects what happens when irrelevant content is interspersed with relevant content in the same session — embedding similarity retrieves the whole session, distractors included. Post-retrieval LLM reading can mitigate this but reintroduces API dependency. Second, AAAK compression mode trades fidelity for token density (84.2% vs 96.6% R@5 on LongMemEval — a 12.4pp regression) and should not be treated as equivalent to raw mode for retrieval-sensitive applications. Third, temporal persona management is unresolved: the framework handles fixed time windows independently, but incrementally updating an existing persona bank as the user’s behavior evolves over months requires a merge strategy that is not yet implemented. Fourth, the benchmark corrections documented in docs/HISTORY.md are notable: a community audit (April 2026) identified metric-comparison errors and overstated claims in the initial release that required table rewrites across multiple surfaces.

Sapiens Q Take

Three observations worth carrying forward. First, verbatim storage beats extraction on retrieval — and the gap is larger than expected. The practical implication for any agent memory pipeline is: if you care about retrieval recall, the first question is whether you need to extract at all. Compression is irreversible; you can always discard at query time via post-retrieval filtering. Second, the HISTORY.md transparency document is worth reading as a model for handling public benchmark corrections. The community audit found real errors; the maintainers addressed each one publicly with before/after numbers, explicit methodology flags (“teaching to the test”), and removed competitor comparisons that used incompatible metrics. This level of disclosure is unusual in the retrieval and memory benchmark space and raises the credibility of the numbers that remain. Third, the 29-tool MCP server is the integration point that makes this practically interesting for agent pipelines: structured memory accessible through standard tool calls, scoped to wings or rooms to keep context loading predictable. For multi-agent systems where individual agents need persistent but isolated memory (one wing per agent), the architecture maps cleanly to common design patterns without requiring custom state management.

References

Repository: MemPalace/mempalace — v3.3.4 at time of writing
Documentation: mempalaceofficial.com
Benchmark corrections and history: docs/HISTORY.md
Benchmark methodology: benchmarks/BENCHMARKS.md
LongMemEval: Wu et al. (2024), arXiv:2407.01437
LoCoMo: Maharana et al. (2024), ACL 2024
MemBench: arXiv:2506.21605 (ACL 2025)
Mem0 (extraction-based alternative): mem0.ai
ChromaDB (default vector backend): trychroma.com