Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions
Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, Wentao Zhang

TL;DR
ReLIFT is a novel training approach that combines reinforcement learning and online fine-tuning to improve large language models' reasoning abilities, especially on questions beyond their initial knowledge, achieving significant performance gains.
Contribution
The paper introduces ReLIFT, a new method that interleaves RL with online fine-tuning, enabling models to learn new reasoning skills beyond their original capabilities.
Findings
ReLIFT improves performance by over 5.2 points on multiple benchmarks.
ReLIFT outperforms pure RL and SFT methods with less demonstration data.
ReLIFT effectively enhances reasoning on out-of-distribution questions.
Abstract
Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling…
Peer Reviews
Decision·ICLR 2026 Poster
* The paper first do analysis on training dynamics of RL vs SFT across different difficulties, showing that RL mainly preserves existing skills while SFT can unlock previously unsolved questions. After that, they propose ReLIFT, which does RL as the primary loop while streaming in SFT steps only on hard questions. The method proposed is gounded on grounded observation, which makes sense. * In the experiments, the paper includes careful controlled comparisons of pure RL, pure SFT, and multiple hy
* The proposed method ReLIFT is built upon the assumption that high-quality CoT already exists so that smaller LLMs can benifit (it's actually a form of distillation from large LLMs like DeepSeek-R1). Although the paper also mentions that such high-quality CoT may come from human annotators in line 202, detailed discussion on "human annotation" is missing. Therefore, the scope of this paper seems limited to how to efficiently distill from larger LLMs to improve the performance of smaller LLMs, r
1. The paper clearly defines the GRPO objective and the alternating SFT loss with entropy regularization (α), and the training curves show how reward, length, and entropy evolve over steps, supporting a mechanism of continued exploration and steady gains. 2. The method is easy to follow thanks to the flow diagram (Figure 2), the difficulty stratification, and the explicit buffer trigger condition (Buffer_ft >= M)which together make reproduction straightforward.
1. The OOD evaluation relies only on MMLU-Pro and the main experiments focus on math reasoning; please add code, science QA, and multi-step commonsense tasks to test adaptability under different verifiable rewards. 2. Beyond **acc(q)=0**, please evaluate thresholds based on uncertainty, length anomalies, or self-contradictions in the CoT, and formalize the adaptive SFT trigger as a gating function of reward or entropy; report learning curves under different gating hyperparameters. 3. Although
The paper presents a method that is grounded in a detailed and revealing analysis about the differing impact of RL and SFT. The core results show that the proposed method yields benefits in-domain (math benchmarks) and on a single out-of-domain benchmark (MMLU-Pro). The analysis of training dynamics and the ablation studies effectively reveal how the proposed method mitigates the limitations of conducting SFT or RL alone, or as distinct training stages.
ReLIFT is considerably less effective on LLaMA-3.1-8B than on Qwen models. It would be useful to include the full set of baselines for a model outside of the Qwen model family, so as to ensure the generality of the method. Currently, only SFT or RL alone, in addition to the instruct variant, are used as baselines for the Llama model. Using a fixed group size of 8 for all experiments means that it is unclear whether the problem difficulty classes assigned during online RL hold for a larger grou
Code & Models
- 🤗RoadQAQ/ReLIFT-Qwen2.5-Math-1.5B-Zeromodel· 109 dl109 dl
- 🤗RoadQAQ/ReLIFT-Qwen2.5-7B-Zeromodel· 2 dl· ♡ 22 dl♡ 2
- 🤗RoadQAQ/ReLIFT-Qwen2.5-Math-7B-Zeromodel· 7 dl7 dl
- 🤗RoadQAQ/Qwen2.5-Math-7B-16k-thinkmodel· 317 dl317 dl
- 🤗RoadQAQ/Qwen2.5-Math-1.5B-16k-thinkmodel· 11k dl11k dl
- 🤗RoadQAQ/Qwen2.5-7B-thinkmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
MethodsShrink and Fine-Tune · Balanced Selection
