Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory
Songwei Dong, Zihan Chen, Chengshuai Shi, Peng Wang, Jundong Li, Cong Shen

TL;DR
This paper introduces SeqMem-Eval, a diagnostic framework for evaluating LLM memory in sequential tasks, revealing that traditional metrics overlook critical memory failure modes like forgetting.
Contribution
It proposes a new evaluation method that assesses memory evolution, generalization, and retention, providing deeper insights into LLM memory performance beyond final accuracy.
Findings
Higher accuracy does not always mean better memory quality.
Different memory methods show distinct trade-offs between adaptability and stability.
Many methods suffer from forgetting and negative transfer despite strong performance.
Abstract
Memory plays a central role in enabling large language models (LLMs) to operate over sequential tasks by accumulating and reusing experience over time. However, existing evaluations of LLM memory mostly rely on aggregate metrics such as final hold-out accuracy or cumulative online performance, which can obscure critical failure modes such as forgetting and negative transfer. In this paper, we introduce SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory. Drawing inspiration from continual learning, it targets a test-time setting in which memory is external, prompt-mediated, and updated without modifying model parameters. Rather than focusing only on final performance, SeqMem-Eval evaluates how memory states evolve, generalize, consolidate experience, and retain useful information during sequential inference. Specifically, it measures online utility,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
