LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners
Junhao Zheng, Xidi Cai, Qiuke Li, Duzhen Zhang, ZhongZhi Li, Yingying Zhang, Le Song, Qianli Ma

TL;DR
LifelongAgentBench is a comprehensive benchmark designed to evaluate and improve the lifelong learning capabilities of LLM-based agents across multiple interactive environments.
Contribution
It introduces a unified, skill-grounded benchmark with automatic verification, and proposes a group self-consistency mechanism to enhance lifelong learning in LLM agents.
Findings
Experience replay has limited effectiveness for LLM agents.
Group self-consistency significantly improves lifelong learning performance.
Benchmark facilitates systematic assessment of LLM agents' memory and adaptability.
Abstract
Lifelong learning is essential for intelligent agents operating in dynamic environments. Current large language model (LLM)-based agents, however, remain stateless and unable to accumulate or transfer knowledge over time. Existing benchmarks treat agents as static systems and fail to evaluate lifelong learning capabilities. We present LifelongAgentBench, the first unified benchmark designed to systematically assess the lifelong learning ability of LLM agents. It provides skill-grounded, interdependent tasks across three interactive environments, Database, Operating System, and Knowledge Graph, with automatic label verification, reproducibility, and modular extensibility. Extensive experiments reveal that conventional experience replay has limited effectiveness for LLM agents due to irrelevant information and context length constraints. We further introduce a group self-consistency…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The benchmark is highly reliable and flexible, easy to use, and readily extensible. 2. The grouped self-consistency mechanism effectively mitigates memory and inference overhead in large-scale experience replay.
1. The paper shows limited novelty. Among the four claimed innovations, Task Dependency is a common method for constructing tasks and does not clearly differ from prior work. Label Verifiability and Reproducibility are basic requirements for a benchmark, while Modularity relates to usability. Only Task Dependency contains some technical content, and the others cannot be considered true innovations. 2. The evaluation of lifelong learning is incomplete because it only considers rapid adaptation to
Important Problem: The paper's core motivation is strong. Evaluating the ability of agents to learn continuously is a critical, timely, and under-studied problem in the field of LLM-based agents. Benchmark Artifact: The creation of a dedicated, open-source benchmark with containerized environments and automatic verification is a non-trivial engineering effort. This infrastructure could, in principle, be a useful tool for the community.
Unclear Definition of "Lifelong Learning": The paper fails to provide a precise and operational definition of lifelong learning. In Section 3, the problem formulation is presented as a generic sequential POMDP, which does not capture any distinctive characteristics of lifelong tasks. No explicit statement is given to clarify what “lifelong” means in this context, and how it affects the benchmark design. Overstated Novelty and Weak Analysis: The claimed methodological contribution appears minor
- This work is the first to propose a benchmark specifically targeting the lifelong learning capability of LLM-based agents, with a novel problem definition that fills a gap in existing evaluation frameworks. - The proposed grouped self-consistency mechanism represents an improvement over traditional experience replay methods, demonstrating methodological innovation. - The work offers an extensive suite of well-defined and verifiable agent tasks, enabling performance evaluation and experiments.
There are flaws in the experimental aspect: 1. On line 054, table 1 only includes a few agent-related benchmarks for comparison. Examples like osworld and browsecomp were not taken into consideration. 2. On line 328, table 2 intends to express the effectiveness of replay, but it only uses one model. 3. Line 435, Table 3 only measured DB and KG. Additionally, the number of models used for DB and KG was different. If a model fails in KG, then DB should not be included either, as it has no signific
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsExperience Replay
