Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, J Ross Mitchell

TL;DR
This paper introduces BEAM, a new benchmark for long-term memory in LLMs, and LIGHT, a framework inspired by human cognition that enhances models with multiple memory systems, significantly improving their performance on long conversations.
Contribution
The paper presents a novel benchmark (BEAM) for evaluating long-term memory in LLMs and proposes LIGHT, a memory-augmented framework that improves model performance on long dialogues.
Findings
LIGHT improves LLM performance by up to 12.69%.
Models with 1M token context struggle with long dialogues.
Ablation confirms each memory component's effectiveness.
Abstract
Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term…
Peer Reviews
Decision·ICLR 2026 Poster
The paper focuses on a highly relevant problem. The data generation method is sufficiently original. Experimental evaluation reasonably complements the data-generation effort and the ablation studies help better understand the contributions of different LIGHT framework components.
My biggest issues with the paper are: - Insufficient coverage of related work + I'm unsure about how this work fits into the broader research context in this domain. The paper is positioned as addressing the problem of long-term memory, but is focused solely on conversation-like context. The way the authors brush off existing work seems a bit dismissive, especially since there are many works that do not suffer from the limitations the authors quote as disqualifying. For example, Narrative QA a
1. Good scale and diversity: context length in conversation setting up to 10M across multiple domains. The samples are validated by human. 2. New complimentary metrics (instruction following, event ordering, and contradiction resolution.) to evaluate the behaviour of LLM in long context setting 3. Cognitive framework - LIGHT - shows convincing improvement in comparison to strong RAG baseline
1. Limited dataset size: only 100 conversations for benchmarking, this could lead to high variance in evaluation results. 2. The paper doesn't deeply analyze why models fail on specific memory abilities. For instance, why do all methods struggle with contradiction resolution? No error analysis or qualitative examination of failure modes beyond the single case study in Appendix F. 3. The combination of retrieval + working memory + external memory is not particularly novel. The main contribution
- BEAM spans multi-domain dialogues at four lengths (128K, 500K, 1M, 10M tokens) and evaluates 10 distinct memory abilities, filling gaps in prior single-domain, recall-heavy datasets. - Component-wise ablations substantiate that each memory module contributes meaningfully to performance. - Results across context scales (100K – 10M) and models show consistent improvements; the paper also includes component-wise ablations and sensitivity to retrieval budget K.
- LIGHT’s episodic memory relies on FAISS with BGE-small embeddings; the paper does not systematically explore robustness to alternative embedding models or indexing setups beyond a brief note. - While a human validation step is described for probes and a separate human evaluation of conversation quality is reported, rater pool size/protocols are not clearly specified, making it hard to assess annotation reliability and potential residual biases over very long chats. - LIGHT maintains an episodi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech and dialogue systems · Multimodal Machine Learning Applications
