Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation
Mengyu Zhang, Siyu Ding, Weichong Yin, Yu Sun, Hua Wu

TL;DR
This paper introduces VMR-RLVR, a novel training strategy that reformulates open-ended tasks into verifiable multiple-choice formats, enabling reinforcement learning with verifiable rewards to improve reasoning in large language models beyond domains with clear solutions.
Contribution
The paper proposes VMR-RLVR, a new method that extends RLVR to open-ended tasks by converting data into verifiable multiple-choice formats, enhancing reasoning capabilities without explicit ground truth.
Findings
Improves LLM performance on open-ended tasks by 3.29 points on average.
Effective across seven diverse open-ended benchmarks.
Enables reinforcement learning without explicit ground truth.
Abstract
Reinforcement Learning with Verifiable Rewards(RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs). However, its success has thus far been largely confined to the mathematical and programming domains with clear and automatically checkable outcomes. Reinforcement learning on open-ended tasks (e.g., creative writing and subjective Q&A) continues to rely on reward models due to the absence of verifiable solutions. This raises a key question: how can we extend RLVR to strengthen reasoning in open-ended tasks regardless of the absence of the unambiguous ground truth? To overcome this challenge, we introduce Verifiable Multiple-Choice Reformulation for Reinforcement Learning from Verifiable Rewards (VMR-RLVR), a novel training strategy that restructures open-ended data into verifiable multiple-choice formats, enabling effective…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Novel extension of RLVR to open-ended domains where standard answers don't exist - Sound mathematical formulation connecting VMR to standard RLVR framework - Clear problem motivation explaining RLVR's limitation in open-ended domains - Addresses important gap: extending RLVR beyond STEM domains
- The connection between multiple-choice discrimination and open-ended generation is assumed but not justified - Only one base model tested (DeepSeek-R1-Distill-14B); crucial to test on other models - Dependency on high-quality preference data limits applicability - Heavy reliance on LLM-as-judge metrics which have known biases
* This paper tries to tackle the LLM training problem that it is not easy to do RL training with open-ended questions. This is an important question that the community tries to solve. * The empirical results from the proposed method seems good, with a noticeable gain comparing with the baselines. * The paper is clear that readers can understand most of the concepts introduced easily.
* The reward verifies only whether the model selected the pre-labeled preferred response, not that the response is objectively better. For the RM-based subset, line 257, the labels themselves are produced by an automated reward model (URM-LLaMA-3.1-8B). Therefore, the pipeline still inherits RM bias/noise even though the training reward is rule-based. This undercuts the claim that they avoid RM issues (line 063, in figure 2). * Most reported wins depend on LLM-as-judge (e.g., MTBench, AlpacaEva
It is important and anticipated to extend RLVR to open-ended tasks.
- **The papers lacks methodological novelty and the method is largely a straightforward combination of existing techniques** (e.g., RLVR, GRPO, and preference-based data formatting) without introducing significant conceptual or architectural innovation. While the VMR is a useful engineering trick, it does not constitute a fundamental advance in reinforcement learning or reasoning modeling. The method section is also overly verbose, repeating well-known formulations without sufficient focus on wh
1. Unlike RLVR’s reliance on explicit ground truths, VMR transforms free-form data into verifiable multiple-choice pairs. It enables rule-based rewards without ambiguous evaluations, solving RLVR’s inapplicability to open-ended scenarios. 2. Across 8 benchmarks, it achieves a 5.99-point average gain over the base model, with standout gains in creative writing. It even outperforms larger 32B-scale models, proving its efficiency in enhancing LLM capabilities. 3. Random response ordering avoids p
1. The method proposed in this paper solves the verification problem in open domains to a certain extent, but it faces significant issues in practical application. It seems that promoting this method to mathematical reasoning would require extremely high costs. The entire method relies on two candidate answers, and the verifier matches answers A and B with the ground truth (GT) option. How can this method be applied to mathematical reasoning where there is a unique GT? Is it necessary to forcibl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
