SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Jiarui Yuan; Tailin Jin; Weize Chen; Zeyuan Liu

arXiv:2602.04811·cs.CL·May 12, 2026

SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Jiarui Yuan, Tailin Jin, Weize Chen, Zeyuan Liu

PDF

1 Repo

TL;DR

SE-Bench is a diagnostic environment designed to evaluate agents' ability to internalize novel knowledge, addressing challenges like knowledge entanglement and reasoning complexity, and revealing key insights about training and internalization methods.

Contribution

The paper introduces SE-Bench, a novel benchmarking platform for measuring self-evolution and knowledge internalization in agents, with insights on training paradigms and internalization techniques.

Findings

01

Training with reference documentation inhibits knowledge retention.

02

Standard RL methods struggle to fully internalize new knowledge.

03

Self-Play combined with SFT enables models to learn from noisy, self-generated tasks.

Abstract

True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thunlp/SE-Bench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.