Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Yuanfu Wang; Zhixuan Liu; Xiangtian Li; Chaochao Lu; Chao Yang

arXiv:2602.11549·cs.LG·March 24, 2026

Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

Yuanfu Wang, Zhixuan Liu, Xiangtian Li, Chaochao Lu, Chao Yang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces NRT, a training framework enabling language models to reason on unverifiable data by generating their own reasoning traces, eliminating the need for external verifiers and reducing data costs.

Contribution

NRT reframes reasoning as a latent variable optimization problem, improving robustness and performance of language models without relying on human-annotated reasoning data or external verifiers.

Findings

01

Achieves state-of-the-art verifier-free reasoning performance

02

Outperforms standard supervised fine-tuning and prior verifier-free methods

03

Demonstrates robustness in complex reasoning tasks

Abstract

The prevailing paradigm for training large reasoning models--combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)--is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a wide range of unverifiable tasks beyond its scope. To overcome these limitations, we introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

The paper is original. It proposed a more general schema which unify 3 previous verifier free methods (JLB, Verifree, and RLPR) . The paper is well written. The quality of ablation study is good.

Weaknesses

No obvious weak points

Reviewer 02Rating 2Confidence 3

Strengths

- This paper is commendable for its exceptional clarity in presentation, which significantly facilitates the review process. The precise and well-structured descriptions allow readers to quickly grasp the core contributions of the work, as well as to readily identify its strengths and potential limitations. - The cover letter provides a remarkably clear and concise summary of the manuscript's key findings. This thoughtful presentation is highly efficient, as it enables reviewers to understand t

Weaknesses

- The author did not explain why the method is effective. The author's "native" responses are derived from the "think-off" mode. Since the quality of the reference answers cannot be guaranteed, I cannot confirm whether this approach is reasonable. - The experiments in the paper were conducted solely on the Llama-3.1-8B and Llama-3.2-3B models. The models are too small to adequately validate the author's claims, and the persuasiveness of experimental results from smaller-scale models is relativel

Reviewer 03Rating 4Confidence 4

Strengths

Clear stance vs RLVR: positions NRT for unverifiable tasks, removing dependency on external checkers. Unified reward view with explicit analysis of aggregation pitfalls and a weighted-sum that targets low-probability “hard” answer tokens. Stability measures: GRPO-style relative advantages over an empty-trace baseline; practical rollout details are transparent. Standardized evaluation harness across a broad benchmark suite, improving comparability.

Weaknesses

Self-referential reward: Even with improved aggregation, the intrinsic signal is driven by the model’s own likelihood over the ground-truth answer. This can inflate confidence without proving better reasoning. The paper lacks external evidence that higher intrinsic reward correlates with logically valid intermediate steps. Scope vs RLVR: While NRT addresses unverifiable domains, the paper does not show how it competes where verifiers do exist (math/code). The claims are limited to “verifier-fre

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)