AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

Zihan Liu; Zhuolin Yang; Yang Chen; Chankyu Lee; Mohammad Shoeybi; Bryan Catanzaro; Wei Ping

arXiv:2506.13284·cs.CL·June 17, 2025

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

Zihan Liu, Zhuolin Yang, Yang Chen, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

PDF

Open Access 10 Models 2 Datasets 3 Reviews

TL;DR

This paper explores how combining supervised fine-tuning and reinforcement learning enhances reasoning models, demonstrating that scaling data and optimizing sampling temperature lead to state-of-the-art performance in math and code benchmarks.

Contribution

It introduces a novel synergy between SFT and RL, with strategies for data scaling and temperature tuning that significantly improve reasoning capabilities.

Findings

01

Scaling prompts improves reasoning performance

02

Optimal sampling temperature balances exploration and exploitation

03

The new model outperforms previous state-of-the-art models

Abstract

In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt. Both approaches yield notable improvements in reasoning performance, with scaling the number of prompts resulting in more substantial gains. We then explore the following questions regarding the synergy between SFT and RL: (i) Does a stronger SFT model consistently lead to better final performance after large-scale RL training? (ii) How can we determine an appropriate sampling temperature during RL training to effectively balance exploration and exploitation for a given SFT initialization? Our findings suggest that (i) holds true, provided effective RL training is conducted,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- This is a well-executed empirical paper that systematically probes how supervised fine-tuning and reinforcement learning interact for long chain-of-thought reasoning, with clear takeaways (e.g., entropy/temperature rule-of-thumb, stage-wise curriculum, and overlong-filtering policy). The paper offers generalizable training guidance. - The empirical program is rigorous and targeted: clearly defined questions, extensive ablations, and transparent evaluation settings. - The model is evaluated o

Weaknesses

- Please report GPU-days per stage to contextualize the recipe’s practicality. - Adding one/more non-Qwen-7B model would validate generality. - The authors did not release code, which may limit reproducibility. - In Figure 4, why is the “final accuracy achieved at the end of each training stage” for different models not shown at the same step? Does this mean that different models require a different number of training steps in a given stage? - Regarding the results in Figure 4, have you conducte

Reviewer 02Rating 8Confidence 4

Strengths

1. Compared with most technical reports that provide only brief descriptions of SFT and RL details, this paper conducts a systematic study on the interplay and integration between SFT and RL using extensive resources. This work is highly significant for understanding the training dynamics of RL initialized from SFT in the research community. 2. The authors analyze SFT data in detail — including the number of prompts, the number of responses per prompt, and their effects on SFT performance — acr

Weaknesses

The experiments are mainly conducted on Qwen2.5-based 7B models. It is unclear whether the authors tested larger models, such as 32B, to verify the generalizability of their conclusions. However, given the large-scale data and the computational resources required, the absence of such experiments is understandable.

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper presents a comprehensive SFT&RL recipe for a 7B reasoning models. 2. Ablation studies from different aspects are conducted, making this work as a solid technical work and providing valuable empirical insights. 3. The resulted model achieves competitive performance on both math and code benchmarks.

Weaknesses

1. The question "Is theMath-onlyStage-1(8K) trulynecessary" is not answered in the submission.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTeaching and Learning Programming

MethodsShrink and Fine-Tune