Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis
Yucheng Shi, Zhenwen Liang, Kishan Panaganti, Dian Yu, Wenhao Yu, Haitao Mi

TL;DR
This paper introduces a method for self-improving language models that construct and validate environments to facilitate ongoing learning, emphasizing environment stability and difficulty calibration.
Contribution
It presents EvoEnv, a novel environment synthesis approach that enables models to generate and validate environments for self-improvement in reasoning tasks.
Findings
EvoEnv improves reasoning accuracy from 72.4% to 74.8%.
Environment synthesis with validation enhances model performance.
Stable environment difficulty is key to sustained self-improvement.
Abstract
We pursue a vision for self-improving language models in which the model does not merely generate problems or traces to imitate, but constructs the environments that train it. In zero-data reasoning RL, this reframes self-improvement from a data-generation loop into an environment-construction loop, where each artifact is a reusable executable object that samples instances, computes references, and scores responses. Whether this vision sustains improvement hinges on a single property: the environments must exhibit stable solve--verify asymmetry, the model must be able to write an oracle once that it cannot reliably execute in natural language on fresh instances. This asymmetry takes two complementary forms. Some tasks are algorithmically hard to reason through but trivial as code: a dynamic program or graph traversal, compiled once, yields unboundedly many calibrated instances. Others…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
