Why Distillation can Outperform Zero-RL: The Role of Flexible Reasoning
Xiao Hu, Xingyu Lu, Liyuan Mao, YiFan Zhang, Tianke Zhang, Bin Wen, Fan Yang, Tingting Gao, Guorui Zhou

TL;DR
This paper demonstrates that a simple distillation approach with minimal data can outperform zero-shot RL in enhancing large language models' reasoning, by promoting more flexible and advanced cognitive behaviors.
Contribution
It shows that distillation with only 920 examples can surpass zero-RL in reasoning ability, emphasizing the role of flexible reasoning and cognitive behaviors.
Findings
Distillation outperforms zero-RL with minimal data.
Distilled models exhibit more flexible reasoning behaviors.
Enhanced cognitive behaviors like Multi-Perspective Thinking are observed.
Abstract
Reinforcement learning (RL) has played an important role in improving the reasoning ability of large language models (LLMs). Some studies apply RL directly to \textit{smaller} base models (known as zero-RL) and also achieve notable progress. However, in this paper, we show that using only 920 examples, a simple distillation method based on the base model can clearly outperform zero-RL, which typically requires much more data and computational cost. By analyzing the token frequency in model outputs, we find that the distilled model shows more flexible reasoning. It uses anthropomorphic tokens and logical connectors much more often than the zero-RL model. Further analysis reveals that distillation enhances the presence of two advanced cognitive behaviors: Multi-Perspective Thinking or Attempting and Metacognitive Awareness. Frequent occurrences of these two advanced cognitive behaviors…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper addresses a practically relevant question about distillation versus zero-RL for smaller models, showing that distillation can outperform zero-RL in some scenarios. The findings have direct implications for practitioners. - Token frequency analysis quantifies stylistic differences between approaches. The conceptualization of "Multi-Perspective Thinking" and "Metacognitive Awareness" offers a framework for understanding machine reasoning, and the token-restriction experiment demonstrat
- The success of distillation depends on access to a superior teacher model (DeepSeek R1) with existing reasoning capability. Table 13 shows distilling from GPT-4o yields poor results. This prerequisite limits the method's scope to scenarios where such expert teachers exist, yet receives insufficient emphasis in the main text. - Several confounding factors complicate interpretation. The 920 AIME problems represent extremely difficult competition mathematics from a single domain—is the gain from
Empirical Contribution: The core finding that 920 distilled examples can match or exceed zero-RL models trained on 10-50× more data is practically valuable and challenges current assumptions about the necessity of expensive RL training for smaller models. Teacher Model Ablation (Table 13): This is a strong experiment showing that distilling from GPT-4o (which lacks flexible reasoning patterns) provides minimal benefit while distilling from QwQ-32B and DeepSeek R1 works well. This supports the c
The core comparison is unfair and the paper's framing is a bit misleading. The title and abstract claim distillation "outperforms" zero-RL, but: Distillation uses DeepSeek R1, which itself required massive computational resources and RL training to develop. This is equivalent to comparing "learning from an expert's pre-computed solutions" versus "solving problems from scratch". The paper should compare total computational budgets including teacher training costs, not just student training. The a
- The framing of the paper is clear where the authors talk about a hypothesis (can distillation with limited samples outperform zero-RL) and then plan experiments in the direction to showcase their findings. - The authors study the reasons behind it and found linguistic patterns and behaviours to justify the hypothesis. - The experiments are controlled and the same model is used to test the hypothesis by training it with distillation vs zero-RL. - The paper is readable and the conclusion is c
1. I think the paper central idea is known already to the community and that's why there is nothing new to get from the paper. It is well known that zero-RL is either quite hard to start (faces a cold start problem for small to mid sized models where the right answers is not presented in the rollouts) or quite expensive if started without SFT (Deepseek R1 paper mentions this briefly where they mentions zero-RL works but it would be better to warm it up with some samples, otherwise quite expensiv
This work studies an interesting topic of comparing distillation and zero-RL training. Authors conduct in-depth analysis to explore why distillation is better to zero-RL training. Good writing work.
1. Insufficient evaluations. This work only evaluates on 32B models. More evaluations should be conducted on smaller size models and other model series, like Qwen-2.5/3-7B/8B/14B or LLaMA models. All experiments conducted in this work are based on Qwen2.5-32B, which cannot reflect the generaliablity of the analysis. 2. I am a bit confused on the motivation of the work. It is good to reveal that the distilled model is better to zero-RL models for Qwen2.5-32B series models. But, does it commonly
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, Reasoning, and Knowledge
