JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
Xinjie Chen, Biao Fu, Jing Wu, Guoxin Chen, Xinggao Liu, Dayiheng Liu, Minpeng Liao

TL;DR
JURY-RL introduces a label-free reinforcement learning framework that improves reasoning in language models by decoupling answer proposal from reward verification, enhancing stability and performance.
Contribution
It proposes a novel label-free RLVR method using votes and formal verification, with a fallback mechanism to stabilize training and improve reasoning accuracy.
Findings
Outperforms other label-free baselines on mathematical reasoning benchmarks.
Achieves pass@1 performance comparable to supervised training.
Demonstrates higher pass@k and response diversity, indicating better generalization.
Abstract
Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
