Evaluating Long-Term Memory for Long-Context Question Answering
Alessandra Terranova, Bj\"orn Ross, Alexandra Birch

TL;DR
This paper systematically evaluates various memory-augmented methods for long-context question answering in large language models, highlighting their effectiveness and how they scale with model capability.
Contribution
It provides a comprehensive comparison of memory types for long-context QA, revealing which approaches improve efficiency and accuracy across different model sizes.
Findings
Memory approaches reduce token usage by over 90%.
RAG benefits foundation models; episodic memory aids instruction-tuned models.
Episodic memory helps models recognize their knowledge limits.
Abstract
In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods on long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90\% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with foundation models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
