Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Linghao Zhu; Yiran Guan; Dingkang Liang; Jianzhong Ju; Zhenbo Luo; Bin Qin; Jian Luan; Yuliang Liu; Xiang Bai

arXiv:2508.05612·cs.LG·March 4, 2026

Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai

PDF

2 Models 2 Datasets 3 Reviews

TL;DR

Shuffle-R1 is a novel RL framework that enhances training efficiency for multimodal large language models by dynamically restructuring data sampling and batch composition, addressing key issues like advantage collapsing and rollout silencing.

Contribution

It introduces data-centric techniques including pairwise trajectory sampling and advantage-based shuffling to improve gradient signals and training efficiency in RL fine-tuning of MLLMs.

Findings

01

Outperforms strong RL baselines on reasoning benchmarks

02

Achieves higher gradient quality with minimal overhead

03

Demonstrates improved long-term learning efficiency

Abstract

Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2)…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The strengths of the paper are: - Simple, practical method that is easy to implement on top of GRPO, which addresses a real pain point in RL fine-tuning: many trajectories are statistically uninformative. - Results are strong and it shows effectiveness on the good coverage of in-domain and out-of-domain benchmarks. - Detailed experiments with ablation studies.

Weaknesses

The weaknesses of the paper are: - Theoretical analysis of bias/variance under selective sampling is limited; unbiasedness is not proven. - Missing some clarifications and ablation study

Reviewer 02Rating 6Confidence 4

Strengths

- Diagnose Advantage Collapsing & Rollout Silencing in MLLM-RL; proposes contrastive pairing + advantage-weighted reshuffle instead of larger rollouts or reward re-design. - Extensive ablations (α, S, PTS variants), 8 datasets, 2 model scales; statistical gains significant; extend to LLMs; code & pseudo-code provided.

Weaknesses

- While the empirical results are strong, the paper lacks formal analysis or theoretical justification for why PTS and ABS improve training dynamics. For example, it would be helpful to show (even intuitively) how contrastive sampling improves gradient variance or convergence rates. - While Shuffle-R1 outperforms GRPO, DAPO, and GSPO, it does not compare with other data-centric RL methods such as curriculum-based sampling, which are relevant to the idea of reusing or reweighting data. A short di

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper pinpoints two concrete and observable issues, "Advantage Collapsing and Rollout Silencing", which intuitively explain why current RL pipelines waste computation and fail to leverage informative signals. This diagnostic perspective is well-motivated. 2. Instead of modifying the reward model or policy objective, Shuffle-R1 improves RL efficiency purely from the data side through Pairwise Trajectory Sampling (PTS) and Advantage-based Batch Shuffle (ABS). Both modules are lightweight,

Weaknesses

1. **Overfitting to high-advantage samples** Since ABS repeatedly exposes high-value trajectories, the framework might bias the model toward a narrower distribution of “reward-dense” samples, reducing exploration and long-term diversity. 2. **Scope of benchmarks** Most experiments are on math or visual reasoning tasks; while results are strong, these domains already have dense reward signals. It remains unclear whether Shuffle-R1 would bring similar benefits on tasks with sparse or noisy rewar

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.