Aletheia: What Makes RLVR For Code Verifiers Tick?
Vatsal Venkatkrishna, Indraneil Paul, Iryna Gurevych

TL;DR
This paper analyzes the key factors influencing Reinforcement Learning with Verifiable Rewards (RLVR) for code verifiers, introducing Aletheia as a testbed to optimize training strategies across different model sizes.
Contribution
It identifies scale-dependent training strategies for RLVR, demonstrating the importance of on-policy learning for small models and thinking traces for larger ones, and introduces Aletheia for controlled analysis.
Findings
On-policy learning is crucial for small verifiers.
Thinking traces are vital for larger verifiers.
Negative samples stabilize training at large sizes.
Abstract
Multi-domain thinking verifiers trained via Reinforcement Learning with Verifiable Rewards (RLVR) are a cornerstone of modern post-training. However, their adoption in code generation has lagged behind execution feedback due to the prohibitive costs of the full RLVR pipeline. In this work, we ablate three primary drivers of RLVR performance and cost: intermediate thinking traces, learning from negative samples, and on-policy training. We introduce Aletheia, a controlled, execution-grounded testbed to facilitate a contamination-free analysis of code verifiers across disparate model sizes and covariate shifts. Our analysis reveals that the optimal training recipe is scale-dependent: on-policy learning is the primary performance driver for small verifiers, whereas thinking traces become the most vital factor for larger sizes. Furthermore, we show that negative samples stabilize training at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
