ReVeal: Self-Evolving Code Agents via Reliable Self-Verification
Yiyang Jin, Kunzhao Xu, Hang Li, Xueting Han, Yanmin Zhou, Cheng Li, Jing Bai

TL;DR
ReVeal is a reinforcement learning framework that enhances code generation by explicitly optimizing self-verification, enabling models to evolve and improve their code through iterative self-assessment and tool-based evaluation, leading to more robust AI agents.
Contribution
ReVeal introduces a novel multi-turn reinforcement learning approach that explicitly optimizes self-verification, co-evolving code and test generation for improved scalability and robustness.
Findings
Enables code evolution over 20+ turns using self-verification.
Significantly improves Pass@k scores, indicating better exploration.
Demonstrates scalability with training on only three datasets.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models. However, existing methods rely solely on outcome rewards, without explicitly optimizing verification or leveraging reliable signals from realistic environments, leading to unreliable self-verification and limited test-time scaling. To address this, we widen the verification-generation asymmetry by explicitly optimizing self-verification, making it a reliable driver of deeper test-time scaling. We introduce ReVeal, a multi-turn reinforcement learning framework that evolves code generation through self-verification and tool-based evaluation. ReVeal structures long-horizon reasoning as iterative generation-verification turns and incorporates TAPO for turn-level credit assignment, fostering the co-evolution of code and test generation. At inference, this strengthened…
Peer Reviews
Decision·ICLR 2026 Poster
1. The ReVeal framework innovatively introduces explicit verification turns during the training process and uses the gold solution to pre-verify the generated test cases. This method enables the model to explicitly learn self-verification capabilities during training, thereby enhancing the reliability of self-verification. 2. ReVeal introduces a Turn-level return mechanism: once the generation is correct, the test generator of the previous round can also receive rewards. This design effectively
1. The paper motivates ReVeal as a way to make self-verification reliable “in realistic environments where public tests are unavailable”. During training, however, the model-generated tests are filtered against a golden solution to guarantee high-quality feedback. At evaluation time (e.g., LiveCodeBench) the final correctness is judged by the benchmark’s own test suites, i.e., still under a setting with reliable canonical tests. Thus, the experiments mainly demonstrate that optimizing verificati
• Utilizing reinforcement learning to enhance a model's ability to solve complex problems, self-reflect, and correct code is a crucial research direction. • The paper is well-structured with a clear and logical flow.
• The experimental dataset is somewhat limited: The main experiments are based on programming competition-style problems. While this effectively tests the model's algorithmic capabilities, it raises questions about the method's generalizability to broader, more realistic real-world development scenarios. • Some experimental setups are not sufficiently clear or systematic: The experimental comparison section has issues that could affect the reliability of the conclusions, such as inconsistent ev
The paper presents a genuinely novel perspective on multi-turn code generation by explicitly optimizing verification as a co-equal objective with generation. While prior work has explored critic models or execution feedback, ReVeal's approach of jointly training generation and verification within a single model through structured turn-level rewards is innovative. The TAPO algorithm, though building on PPO, introduces a sensible credit assignment mechanism tailored to the generation-verification
The ablation analysis is concerningly limited for a paper making multiple methodological contributions. Table 1 only compares "outcome reward" versus "TAPO with joint rewards" as a monolithic change, without isolating the contribution of individual components. Critical missing ablations include: (1) What happens with only generation rewards or only verification rewards? (2) How does the specific turn-level return formulation in the equation compare to simpler alternatives? (3) What is the effect
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
MethodsBalanced Selection
