SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning
Yihao Liu, Shuocheng Li, Lang Cao, Yuhang Xie, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang

TL;DR
SuperRL is a novel training framework that combines reinforcement learning and supervised fine-tuning to improve reasoning in language models, especially in environments with sparse rewards.
Contribution
It introduces an adaptive method that switches between RL and SFT, effectively utilizing offline data to enhance learning efficiency and performance.
Findings
SuperRL outperforms vanilla RL in sample efficiency.
SuperRL achieves better generalization on reasoning benchmarks.
SuperRL demonstrates increased robustness under sparse rewards.
Abstract
Large language models are increasingly used for complex reasoning tasks where high-quality offline data such as expert-annotated solutions and distilled reasoning traces are often available. However, in environments with sparse rewards, reinforcement learning struggles to sample successful trajectories, leading to inefficient learning. At the same time, these offline trajectories that represent correct reasoning paths are not utilized by standard on-policy reinforcement learning methods. We introduce SuperRL, a unified training framework that adaptively alternates between RL and SFT. Whenever every rollout for a given instance receives zero reward, indicating the absence of a learning signal, SuperRL falls back to SFT on the curated offline data. Extensive experiments across diverse reasoning benchmarks show that SuperRL surpasses vanilla RL by delivering higher sample efficiency,…
Peer Reviews
Decision·Submitted to ICLR 2026
1. **Simplicity and Effectiveness:** SuperRL's core concept—a conditional switch between RL and SFT based on the observed reward signal—is elegantly simple. It requires minimal hyperparameter tuning and avoids the complexity of multi-stage pipelines or manually interpolated loss functions, which significantly enhances reproducibility and scalability. 2. **Targeted Solution to the Sparsity Problem:** The framework directly addresses the primary challenge of applying RL in reasoning tasks: sparse
1. **Missing Crucial Baseline Comparison:** The paper lacks a comparison to a conceptually simple, yet potentially competitive, baseline method. Specifically, an ablation where, after SFT and during the RL phase, the model simply drops or masks any trajectories/prompts that yield zero reward/advantage (i.e., treating them as noise and performing the RL update only on rewarded trajectories) should be included. This comparison is necessary to demonstrate that the benefit of SuperRL comes specifica
The strengths of the paper are outlined as below: 1. The paper proposed a simple approach to unify SFT and RL by using zero rewards or zero advantage as a switching signal. 2. The paper proposed two fallback mechanisms for triggering SFT during RL training based on advantage and reward: SuperRL-A and SuperRL-R. They validated both their mechanism and provided empirical guidelines for choosing one method over another. 3. SuperRL experimental results demonstrate strong performance both for in-do
The weaknesses of the paper can be summarized as follows: 1. It is unclear how the trajectories for SFT are selected when the advantage or reward approaches zero. Are the samples with zero advantage or reward directly used for SFT? 2. SFT typically requires high-quality samples. If trajectories with zero reward are used for SFT, how can they be considered high-quality? 3. In terms of stability, SuperRL’s improvement is modest. Although there are some reductions in variance, the change in rang
This is a well-written and that is accessible to experts in adjacent fields. The hypothesis of the paper is clear and well motivated. The performance of the proposed approach is evaluated and compared against a selection of alternative methods. The results indicate some improvements over alternative methods for different benchmarks. The evaluation shows that the results are comparable to state-of-the-art models.
The paper claims that a “unified training framework” is proposed for switching between RL and SFT. However, the methodology in Section 3 is limited to the extrema of zero-reward vs non-reward reward. This results in a heuristic switching mechanism between the two paradigms. While this switching mechanism is reasonably well justified, it is unclear to what extent it enables a “unified framework that adaptively combines” RL and SFT. While the paper is very easy to read and the results are intuiti
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
