NFT: Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

Huayu Chen; Kaiwen Zheng; Qinsheng Zhang; Ganqu Cui; Lifan Yuan; Yin Cui; Haotian Ye; Tsung-Yi Lin; Ming-Yu Liu; Jun Zhu; Haoxiang Wang

arXiv:2505.18116·cs.LG·March 3, 2026

NFT: Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Lifan Yuan, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, Haoxiang Wang

PDF

4 Models 3 Reviews

TL;DR

This paper introduces Negative-aware Fine-Tuning (NFT), a supervised learning method that enables large language models to self-improve in math reasoning tasks by leveraging negative feedback, bridging the gap between supervised and reinforcement learning approaches.

Contribution

NFT is a novel supervised approach that models negative feedback implicitly, allowing LLMs to reflect on failures and improve without external teachers, matching RL performance.

Findings

01

NFT improves math reasoning performance over standard supervised methods.

02

NFT matches or surpasses RL algorithms like GRPO and DAPO.

03

NFT and RL methods are theoretically equivalent in strict-on-policy training.

Abstract

Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs'…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

- The authors posit a clear research question to explore in their paper: the current approach to fine-tuning may be competitive for self-improvement compared to current approaches to reinforcement learning. - Motivated by this research question, the authors identify a clear weakness in current apporaches to fine-tuning: it is not possible to use the full dataset when fine-tuning only on positive samples.

Weaknesses

- While many aspects of the paper are clear, some of the benfits of their negative-aware fine tuning method are not obvious. For example, they choose to use a specific implicit policy approach, but dont compare against an explicit method that conditions on the reward. Also, in their ablations, their practical modifications seem to have a much larger effect on performance than solely using the negative data. - The motivating research question does not feature prominently in the experiments, besid

Reviewer 02Rating 2Confidence 4

Strengths

- Relevant and timely topic. - Comprehensive evaluation across a broad range of reasoning datasets.

Weaknesses

- The selling point of NFT is that it is an online SFT method. However, being ``online SFT'' does not immediately provide conceptual advantages. If one considers RL as a paradigm of learning under interaction with an environment, then online SFT is simply a special case of RL. The advantages therefore need to be shown practically, for example, through computational or empirical improvements, rather than assumed conceptually. The paper state: > Memory Efficiency. NFT is memory-efficient. In p

Reviewer 03Rating 8Confidence 3

Strengths

1. The paper is well written and clearly structured, making it easy to follow. 2. Theoretical contribution — the equivalence between NFT and GRPO is well derived and provides valuable conceptual clarity between reinforcement learning and supervised fine-tuning. 3. The method is simple, effective, and memory-efficient, avoiding the complexity of traditional RL pipelines. 4. Unlike DPO, NFT does not require explicit positive–negative pairs, which improves data efficiency and implementation flex

Weaknesses

1. The evaluation is limited to mathematical reasoning, with no experiments on logical or commonsense reasoning tasks, limiting evidence of generalization. 2. While Table 2 shows strong performance compared to other RL-based methods, the paper lacks quantitative analysis of efficiency (e.g., training FLOPs, wall-clock time, or GPU hours). Since NFT claims to be simpler and more efficient than RL-based training, this measurement is important to support the claim. 3. Additional discussion on fai

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.