AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy
Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

TL;DR
This paper explores how combining supervised fine-tuning and reinforcement learning enhances reasoning models, demonstrating that scaling data and optimizing sampling temperature lead to state-of-the-art performance in math and code benchmarks.
Contribution
It introduces a novel synergy between SFT and RL, with strategies for data scaling and temperature tuning that significantly improve reasoning capabilities.
Findings
Scaling prompts improves reasoning performance
Optimal sampling temperature balances exploration and exploitation
The new model outperforms previous state-of-the-art models
Abstract
In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization? Our findings suggest that (i) holds true, provided effective RL training is conducted,…
Peer Reviews
Decision·ICLR 2026 Poster
- This is a well-executed empirical paper that systematically probes how supervised fine-tuning and reinforcement learning interact for long chain-of-thought reasoning, with clear takeaways (e.g., entropy/temperature rule-of-thumb, stage-wise curriculum, and overlong-filtering policy). The paper offers generalizable training guidance. - The empirical program is rigorous and targeted: clearly defined questions, extensive ablations, and transparent evaluation settings. - The model is evaluated o
- Please report GPU-days per stage to contextualize the recipe’s practicality. - Adding one/more non-Qwen-7B model would validate generality. - The authors did not release code, which may limit reproducibility. - In Figure 4, why is the “final accuracy achieved at the end of each training stage” for different models not shown at the same step? Does this mean that different models require a different number of training steps in a given stage? - Regarding the results in Figure 4, have you conducte
1. Compared with most technical reports that provide only brief descriptions of SFT and RL details, this paper conducts a systematic study on the interplay and integration between SFT and RL using extensive resources. This work is highly significant for understanding the training dynamics of RL initialized from SFT in the research community. 2. The authors analyze SFT data in detail — including the number of prompts, the number of responses per prompt, and their effects on SFT performance — acr
The experiments are mainly conducted on Qwen2.5-based 7B models. It is unclear whether the authors tested larger models, such as 32B, to verify the generalizability of their conclusions. However, given the large-scale data and the computational resources required, the absence of such experiments is understandable.
1. The paper presents a comprehensive SFT&RL recipe for a 7B reasoning models. 2. Ablation studies from different aspects are conducted, making this work as a solid technical work and providing valuable empirical insights. 3. The resulted model achieves competitive performance on both math and code benchmarks.
1. The question "Is theMath-onlyStage-1(8K) trulynecessary" is not answered in the submission.
Code & Models
- 🤗nvidia/AceReason-Nemotron-14Bmodel· 22k dl· ♡ 9622k dl♡ 96
- 🤗nvidia/AceReason-Nemotron-7Bmodel· 5.2k dl· ♡ 205.2k dl♡ 20
- 🤗nvidia/AceReason-Nemotron-1.1-7Bmodel· 5.1k dl· ♡ 575.1k dl♡ 57
- 🤗gabriellarson/AceReason-Nemotron-1.1-7B-GGUFmodel· 173 dl· ♡ 2173 dl♡ 2
- 🤗lmstudio-community/AceReason-Nemotron-1.1-7B-GGUFmodel· 49 dl· ♡ 149 dl♡ 1
- 🤗QuantFactory/AceReason-Nemotron-1.1-7B-GGUFmodel· 46 dl· ♡ 246 dl♡ 2
- 🤗Prince-1/AceReason-Nemotron-1.1-7B-Onnxmodel
- 🤗Prince-1/AceReason-Nemotron-14B-Onnxmodel· ♡ 1♡ 1
- 🤗RivianG/AceReason-Nemotron-1.1-7B_quantmodel
- 🤗RivianG/AceReason-Nemotron-1.1-7B-bnb-4bitmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTeaching and Learning Programming
MethodsShrink and Fine-Tune
