TL;DR
This paper introduces Negative-aware Fine-Tuning (NFT), a supervised learning method that enables large language models to self-improve in math reasoning tasks by leveraging negative feedback, bridging the gap between supervised and reinforcement learning approaches.
Contribution
NFT is a novel supervised approach that models negative feedback implicitly, allowing LLMs to reflect on failures and improve without external teachers, matching RL performance.
Findings
NFT improves math reasoning performance over standard supervised methods.
NFT matches or surpasses RL algorithms like GRPO and DAPO.
NFT and RL methods are theoretically equivalent in strict-on-policy training.
Abstract
Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs'…
Peer Reviews
Decision·ICLR 2026 Poster
- The authors posit a clear research question to explore in their paper: the current approach to fine-tuning may be competitive for self-improvement compared to current approaches to reinforcement learning. - Motivated by this research question, the authors identify a clear weakness in current apporaches to fine-tuning: it is not possible to use the full dataset when fine-tuning only on positive samples.
- While many aspects of the paper are clear, some of the benfits of their negative-aware fine tuning method are not obvious. For example, they choose to use a specific implicit policy approach, but dont compare against an explicit method that conditions on the reward. Also, in their ablations, their practical modifications seem to have a much larger effect on performance than solely using the negative data. - The motivating research question does not feature prominently in the experiments, besid
- Relevant and timely topic. - Comprehensive evaluation across a broad range of reasoning datasets.
- The selling point of NFT is that it is an online SFT method. However, being ``online SFT'' does not immediately provide conceptual advantages. If one considers RL as a paradigm of learning under interaction with an environment, then online SFT is simply a special case of RL. The advantages therefore need to be shown practically, for example, through computational or empirical improvements, rather than assumed conceptually. The paper state: > Memory Efficiency. NFT is memory-efficient. In p
1. The paper is well written and clearly structured, making it easy to follow. 2. Theoretical contribution — the equivalence between NFT and GRPO is well derived and provides valuable conceptual clarity between reinforcement learning and supervised fine-tuning. 3. The method is simple, effective, and memory-efficient, avoiding the complexity of traditional RL pipelines. 4. Unlike DPO, NFT does not require explicit positive–negative pairs, which improves data efficiency and implementation flex
1. The evaluation is limited to mathematical reasoning, with no experiments on logical or commonsense reasoning tasks, limiting evidence of generalization. 2. While Table 2 shows strong performance compared to other RL-based methods, the paper lacks quantitative analysis of efficiency (e.g., training FLOPs, wall-clock time, or GPU hours). Since NFT claims to be simpler and more efficient than RL-based training, this measurement is important to support the claim. 3. Additional discussion on fai
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
