NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning
Wei Liu, Siya Qi, Xinyu Wang, Chen Qian, Yali Du, Yulan He

TL;DR
NOVER introduces a verifier-free reinforcement learning framework for language models that enhances reasoning capabilities without external verifiers, using only supervised fine-tuning data, and outperforms some large reasoning models.
Contribution
NOVER presents a novel incentive training method that eliminates the need for external verifiers, broadening applicability and improving performance of language models.
Findings
Outperforms distilled models from large reasoning models by 7.7%.
Enables incentive training across diverse text-to-text tasks.
Supports inverse incentive training for further optimization.
Abstract
Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model's output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
