Assessing Episodic Memory in LLMs with Sequence Order Recall Tasks
Mathis Pink, Vy A. Vo, Qinyuan Wu, Jianing Mu, Javier S. Turek, Uri, Hasson, Kenneth A. Norman, Sebastian Michelmann, Alexander Huth, Mariya, Toneva

TL;DR
This paper introduces Sequence Order Recall Tasks (SORT) to evaluate episodic memory in large language models, highlighting their ability to recall contextual sequences and identifying current limitations in models' long-term episodic memory capabilities.
Contribution
The paper develops a new benchmark, SORT, for assessing episodic memory in LLMs, adapting cognitive psychology tasks for NLP evaluation.
Findings
Humans can recall sequence order based on long-term memory.
Models perform well with in-context text but struggle with memory from training alone.
SORT provides a new framework for evaluating and developing memory-augmented models.
Abstract
Current LLM benchmarks focus on evaluating models' memory of facts and semantic relations, primarily assessing semantic aspects of long-term memory. However, in humans, long-term memory also includes episodic memory, which links memories to their contexts, such as the time and place they occurred. The ability to contextualize memories is crucial for many cognitive tasks and everyday functions. This form of memory has not been evaluated in LLMs with existing benchmarks. To address the gap in evaluating memory in LLMs, we introduce Sequence Order Recall Tasks (SORT), which we adapt from tasks used to study episodic memory in cognitive psychology. SORT requires LLMs to recall the correct order of text segments, and provides a general framework that is both easily extendable and does not require any additional annotations. We present an initial evaluation dataset, Book-SORT, comprising 36k…
Peer Reviews
Decision·Submitted to ICLR 2025
The paper addresses a highly important topic of assessing memory capabilities in LLMs. I appreciate the attention to statistical detail reporting, and the generally thorough reporting of the experiments. The paper is well-written. The inclusion of human baselines, especially with the clever involvement of Goodreads users, is a great addition to the paper.
** Significance and novelty ** I believe that the work, overall, is sufficiently novel. At the very least, it provides a unique and potentially useful dataset. I'm however, very conflicted on the significance of this work. First, the authors do not discuss whether the notion of Episodic memory in LLMs is valid at all. "Episodic memory" is not clearly defined in the paper, other than type of memory "which links memories to their contexts, such as the time and place they occurred". If we take t
- **Valuable Human Evaluation**: The paper conducts a large-scale human evaluation involving 155 participants who recently read a book. This provides valuable data on human long-term memory performance in recalling the order of text segments. - **Highlighting Limitations of Parametric Memory**: Although simple, the paper rightfully brings attention to the limitations of current LLMs' parametric memory in handling one of the episodic memory tasks (here segment order). This is important for raisi
- **Little coverage of episodic memory** The correct order of text segments is one tiny aspect in a large spectrum of episodic memory capabilities: (i) places/spaces in which events occured, the (ii) intricate sorting of events and entities that are tracked along long periods of time. For example, we are able not only to sort the major events of our life, but also the major events that happened within the lifetime of a six-months long project, or the major events that happened in a movie, or a n
1. The paper offers a simple yet effective task, easily understandable and extensible to various settings. The general formulation of SORT enables adaptation to multiple domains and sequential data types, such as video or audio, enhancing clarity and future applicability. 2. The authors have open-sourced both the Book-SORT dataset and the code, ensuring reproducibility and enabling further research by the community. 3. The paper features multiple experiment setting, covering in-context memory,
1. Although the SORT task is positioned as an assessment of episodic memory, the simple task of sequence order recall does not fully capture the complexities of episodic memory as seen in humans. Episodic memory typically involves more than the temporal order of events; it also includes contextual elements such as the significance, location, or source of memories. As currently designed, SORT assesses temporal ordering without requiring models to associate segments with rich context or situationa
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Neural Networks and Applications
MethodsFocus
