JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Xinjie Chen; Biao Fu; Jing Wu; Guoxin Chen; Xinggao Liu; Dayiheng Liu; Minpeng Liao

arXiv:2604.25419·cs.AI·April 29, 2026

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Xinjie Chen, Biao Fu, Jing Wu, Guoxin Chen, Xinggao Liu, Dayiheng Liu, Minpeng Liao

PDF

TL;DR

JURY-RL introduces a label-free reinforcement learning framework that improves reasoning in language models by decoupling answer proposal from reward verification, enhancing stability and performance.

Contribution

It proposes a novel label-free RLVR method using votes and formal verification, with a fallback mechanism to stabilize training and improve reasoning accuracy.

Findings

01

Outperforms other label-free baselines on mathematical reasoning benchmarks.

02

Achieves pass@1 performance comparable to supervised training.

03

Demonstrates higher pass@k and response diversity, indicating better generalization.

Abstract

Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.