JudgeRLVR: Judge First, Generate Second for Efficient Reasoning
Jiangshan Duo, Hanyu Li, Hailin Zhang, Yudong Wang, Sujian Li, Liang Zhao

TL;DR
JudgeRLVR introduces a two-stage judge-then-generate approach that improves reasoning efficiency and accuracy in large language models by learning to discriminate valid solutions before generation.
Contribution
It proposes a novel judge-then-generate paradigm that enhances reasoning efficiency and accuracy in RLVR by incorporating a discriminative judgment stage.
Findings
Achieves +3.7 points accuracy gain on in-domain math tasks.
Reduces average generation length by 42%.
Improves out-of-domain benchmark performance by +4.5 points.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
