TL;DR
SE-Bench is a diagnostic environment designed to evaluate agents' ability to internalize novel knowledge, addressing challenges like knowledge entanglement and reasoning complexity, and revealing key insights about training and internalization methods.
Contribution
The paper introduces SE-Bench, a novel benchmarking platform for measuring self-evolution and knowledge internalization in agents, with insights on training paradigms and internalization techniques.
Findings
Training with reference documentation inhibits knowledge retention.
Standard RL methods struggle to fully internalize new knowledge.
Self-Play combined with SFT enables models to learn from noisy, self-generated tasks.
Abstract
True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
