Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

Songwei Dong; Zihan Chen; Chengshuai Shi; Peng Wang; Jundong Li; Cong Shen

arXiv:2605.15384·cs.LG·May 18, 2026

Is One Score Enough? Rethinking the Evaluation of Sequentially Evolving LLM Memory

Songwei Dong, Zihan Chen, Chengshuai Shi, Peng Wang, Jundong Li, Cong Shen

PDF

TL;DR

This paper introduces SeqMem-Eval, a diagnostic framework for evaluating LLM memory in sequential tasks, revealing that traditional metrics overlook critical memory failure modes like forgetting.

Contribution

It proposes a new evaluation method that assesses memory evolution, generalization, and retention, providing deeper insights into LLM memory performance beyond final accuracy.

Findings

01

Higher accuracy does not always mean better memory quality.

02

Different memory methods show distinct trade-offs between adaptability and stability.

03

Many methods suffer from forgetting and negative transfer despite strong performance.

Abstract

Memory plays a central role in enabling large language models (LLMs) to operate over sequential tasks by accumulating and reusing experience over time. However, existing evaluations of LLM memory mostly rely on aggregate metrics such as final hold-out accuracy or cumulative online performance, which can obscure critical failure modes such as forgetting and negative transfer. In this paper, we introduce SeqMem-Eval, a diagnostic evaluation framework for sequentially evolving LLM memory. Drawing inspiration from continual learning, it targets a test-time setting in which memory is external, prompt-mediated, and updated without modifying model parameters. Rather than focusing only on final performance, SeqMem-Eval evaluates how memory states evolve, generalize, consolidate experience, and retain useful information during sequential inference. Specifically, it measures online utility,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.