SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment
Sihang Jiang, Lipeng Ma, Zhonghua Hong, Keyi Wang, Zhiyu Lu, Shisong Chen, Jinghao Zhang, Tianjun Pan, Weijia Zhou, Jiaqing Liang, Yanghua Xiao

TL;DR
This paper introduces SEA-Eval, a new benchmark for evaluating self-evolving agents that can learn across tasks, formalizes their architecture, and highlights the importance of sequential convergence over success rate.
Contribution
It provides the first formal definition of Self-Evolving Agents, formalizes the Evolutionary Flywheel architecture, and develops SEA-Eval for comprehensive evaluation of evolutionary capabilities.
Findings
Token consumption varies up to 31.2× across frameworks with same success rates.
Sequential analysis reveals divergent evolutionary trajectories.
Success rate alone can create a capability illusion.
Abstract
Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience across task boundaries. This paper presents the first formal definition of the Self-Evolving Agent (SEA), formalizes the Evolutionary Flywheel as its minimal sufficient architecture, and introduces SEA-Eval -- the first benchmark designed specifically for evaluating SEAs. Grounded in Flywheel theory, SEA-Eval establishes and as primary metrics and enables through sequential task stream design the independent quantification of evolutionary gain, evolutionary stability, and implicit alignment convergence. Empirical evaluation reveals that under identical success rates, token consumption differs by up to 31.2 across frameworks, with divergent evolutionary trajectories under sequential analysis --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
