SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning

Yihao Liu; Shuocheng Li; Lang Cao; Yuhang Xie; Mengyu Zhou; Haoyu Dong; Xiaojun Ma; Shi Han; Dongmei Zhang

arXiv:2506.01096·cs.AI·August 11, 2025

SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning

Yihao Liu, Shuocheng Li, Lang Cao, Yuhang Xie, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang

PDF

Open Access 3 Reviews

TL;DR

SuperRL is a novel training framework that combines reinforcement learning and supervised fine-tuning to improve reasoning in language models, especially in environments with sparse rewards.

Contribution

It introduces an adaptive method that switches between RL and SFT, effectively utilizing offline data to enhance learning efficiency and performance.

Findings

01

SuperRL outperforms vanilla RL in sample efficiency.

02

SuperRL achieves better generalization on reasoning benchmarks.

03

SuperRL demonstrates increased robustness under sparse rewards.

Abstract

Large language models are increasingly used for complex reasoning tasks where high-quality offline data such as expert-annotated solutions and distilled reasoning traces are often available. However, in environments with sparse rewards, reinforcement learning struggles to sample successful trajectories, leading to inefficient learning. At the same time, these offline trajectories that represent correct reasoning paths are not utilized by standard on-policy reinforcement learning methods. We introduce SuperRL, a unified training framework that adaptively alternates between RL and SFT. Whenever every rollout for a given instance receives zero reward, indicating the absence of a learning signal, SuperRL falls back to SFT on the curated offline data. Extensive experiments across diverse reasoning benchmarks show that SuperRL surpasses vanilla RL by delivering higher sample efficiency,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. **Simplicity and Effectiveness:** SuperRL's core concept—a conditional switch between RL and SFT based on the observed reward signal—is elegantly simple. It requires minimal hyperparameter tuning and avoids the complexity of multi-stage pipelines or manually interpolated loss functions, which significantly enhances reproducibility and scalability. 2. **Targeted Solution to the Sparsity Problem:** The framework directly addresses the primary challenge of applying RL in reasoning tasks: sparse

Weaknesses

1. **Missing Crucial Baseline Comparison:** The paper lacks a comparison to a conceptually simple, yet potentially competitive, baseline method. Specifically, an ablation where, after SFT and during the RL phase, the model simply drops or masks any trajectories/prompts that yield zero reward/advantage (i.e., treating them as noise and performing the RL update only on rewarded trajectories) should be included. This comparison is necessary to demonstrate that the benefit of SuperRL comes specifica

Reviewer 02Rating 6Confidence 4

Strengths

The strengths of the paper are outlined as below: 1. The paper proposed a simple approach to unify SFT and RL by using zero rewards or zero advantage as a switching signal. 2. The paper proposed two fallback mechanisms for triggering SFT during RL training based on advantage and reward: SuperRL-A and SuperRL-R. They validated both their mechanism and provided empirical guidelines for choosing one method over another. 3. SuperRL experimental results demonstrate strong performance both for in-do

Weaknesses

The weaknesses of the paper can be summarized as follows: 1. It is unclear how the trajectories for SFT are selected when the advantage or reward approaches zero. Are the samples with zero advantage or reward directly used for SFT? 2. SFT typically requires high-quality samples. If trajectories with zero reward are used for SFT, how can they be considered high-quality? 3. In terms of stability, SuperRL’s improvement is modest. Although there are some reductions in variance, the change in rang

Reviewer 03Rating 4Confidence 1

Strengths

This is a well-written and that is accessible to experts in adjacent fields. The hypothesis of the paper is clear and well motivated. The performance of the proposed approach is evaluated and compared against a selection of alternative methods. The results indicate some improvements over alternative methods for different benchmarks. The evaluation shows that the results are comparable to state-of-the-art models.

Weaknesses

The paper claims that a “unified training framework” is proposed for switching between RL and SFT. However, the methodology in Section 3 is limited to the extrema of zero-reward vs non-reward reward. This results in a heuristic switching mechanism between the two paradigms. While this switching mechanism is reasonably well justified, it is unclear to what extent it enables a “unified framework that adaptively combines” RL and SFT. While the paper is very easy to read and the results are intuiti

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques