SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

Sihang Jiang; Lipeng Ma; Zhonghua Hong; Keyi Wang; Zhiyu Lu; Shisong Chen; Jinghao Zhang; Tianjun Pan; Weijia Zhou; Jiaqing Liang; Yanghua Xiao

arXiv:2604.08988·cs.AI·April 15, 2026

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

Sihang Jiang, Lipeng Ma, Zhonghua Hong, Keyi Wang, Zhiyu Lu, Shisong Chen, Jinghao Zhang, Tianjun Pan, Weijia Zhou, Jiaqing Liang, Yanghua Xiao

PDF

TL;DR

This paper introduces SEA-Eval, a new benchmark for evaluating self-evolving agents that can learn across tasks, formalizes their architecture, and highlights the importance of sequential convergence over success rate.

Contribution

It provides the first formal definition of Self-Evolving Agents, formalizes the Evolutionary Flywheel architecture, and develops SEA-Eval for comprehensive evaluation of evolutionary capabilities.

Findings

01

Token consumption varies up to 31.2× across frameworks with same success rates.

02

Sequential analysis reveals divergent evolutionary trajectories.

03

Success rate alone can create a capability illusion.

Abstract

Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience across task boundaries. This paper presents the first formal definition of the Self-Evolving Agent (SEA), formalizes the Evolutionary Flywheel as its minimal sufficient architecture, and introduces SEA-Eval -- the first benchmark designed specifically for evaluating SEAs. Grounded in Flywheel theory, SEA-Eval establishes $S R$ and $T$ as primary metrics and enables through sequential task stream design the independent quantification of evolutionary gain, evolutionary stability, and implicit alignment convergence. Empirical evaluation reveals that under identical success rates, token consumption differs by up to 31.2 $\times$ across frameworks, with divergent evolutionary trajectories under sequential analysis --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.