Native Reasoning Models: Training Language Models to Reason on Unverifiable Data
Yuanfu Wang, Zhixuan Liu, Xiangtian Li, Chaochao Lu, Chao Yang

TL;DR
This paper introduces NRT, a training framework enabling language models to reason on unverifiable data by generating their own reasoning traces, eliminating the need for external verifiers and reducing data costs.
Contribution
NRT reframes reasoning as a latent variable optimization problem, improving robustness and performance of language models without relying on human-annotated reasoning data or external verifiers.
Findings
Achieves state-of-the-art verifier-free reasoning performance
Outperforms standard supervised fine-tuning and prior verifier-free methods
Demonstrates robustness in complex reasoning tasks
Abstract
The prevailing paradigm for training large reasoning models--combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)--is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a wide range of unverifiable tasks beyond its scope. To overcome these limitations, we introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is original. It proposed a more general schema which unify 3 previous verifier free methods (JLB, Verifree, and RLPR) . The paper is well written. The quality of ablation study is good.
No obvious weak points
- This paper is commendable for its exceptional clarity in presentation, which significantly facilitates the review process. The precise and well-structured descriptions allow readers to quickly grasp the core contributions of the work, as well as to readily identify its strengths and potential limitations. - The cover letter provides a remarkably clear and concise summary of the manuscript's key findings. This thoughtful presentation is highly efficient, as it enables reviewers to understand t
- The author did not explain why the method is effective. The author's "native" responses are derived from the "think-off" mode. Since the quality of the reference answers cannot be guaranteed, I cannot confirm whether this approach is reasonable. - The experiments in the paper were conducted solely on the Llama-3.1-8B and Llama-3.2-3B models. The models are too small to adequately validate the author's claims, and the persuasiveness of experimental results from smaller-scale models is relativel
Clear stance vs RLVR: positions NRT for unverifiable tasks, removing dependency on external checkers. Unified reward view with explicit analysis of aggregation pitfalls and a weighted-sum that targets low-probability “hard” answer tokens. Stability measures: GRPO-style relative advantages over an empty-trace baseline; practical rollout details are transparent. Standardized evaluation harness across a broad benchmark suite, improving comparability.
Self-referential reward: Even with improved aggregation, the intrinsic signal is driven by the model’s own likelihood over the ground-truth answer. This can inflate confidence without proving better reasoning. The paper lacks external evidence that higher intrinsic reward correlates with logically valid intermediate steps. Scope vs RLVR: While NRT addresses unverifiable domains, the paper does not show how it competes where verifiers do exist (math/code). The claims are limited to “verifier-fre
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
