Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle
Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai

TL;DR
Shuffle-R1 is a novel RL framework that enhances training efficiency for multimodal large language models by dynamically restructuring data sampling and batch composition, addressing key issues like advantage collapsing and rollout silencing.
Contribution
It introduces data-centric techniques including pairwise trajectory sampling and advantage-based shuffling to improve gradient signals and training efficiency in RL fine-tuning of MLLMs.
Findings
Outperforms strong RL baselines on reasoning benchmarks
Achieves higher gradient quality with minimal overhead
Demonstrates improved long-term learning efficiency
Abstract
Reinforcement learning (RL) has emerged as an effective post-training paradigm for enhancing the reasoning capabilities of multimodal large language model (MLLM). However, current RL pipelines often suffer from training inefficiencies caused by two underexplored issues: Advantage Collapsing, where most advantages in a batch concentrate near zero, and Rollout Silencing, where the proportion of rollouts contributing non-zero gradients diminishes over time. These issues lead to suboptimal gradient updates and hinder long-term learning efficiency. To address these issues, we propose Shuffle-R1, a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces (1) Pairwise Trajectory Sampling, which selects high-contrast trajectories with large advantages to improve gradient signal quality, and (2)…
Peer Reviews
Decision·ICLR 2026 Poster
The strengths of the paper are: - Simple, practical method that is easy to implement on top of GRPO, which addresses a real pain point in RL fine-tuning: many trajectories are statistically uninformative. - Results are strong and it shows effectiveness on the good coverage of in-domain and out-of-domain benchmarks. - Detailed experiments with ablation studies.
The weaknesses of the paper are: - Theoretical analysis of bias/variance under selective sampling is limited; unbiasedness is not proven. - Missing some clarifications and ablation study
- Diagnose Advantage Collapsing & Rollout Silencing in MLLM-RL; proposes contrastive pairing + advantage-weighted reshuffle instead of larger rollouts or reward re-design. - Extensive ablations (α, S, PTS variants), 8 datasets, 2 model scales; statistical gains significant; extend to LLMs; code & pseudo-code provided.
- While the empirical results are strong, the paper lacks formal analysis or theoretical justification for why PTS and ABS improve training dynamics. For example, it would be helpful to show (even intuitively) how contrastive sampling improves gradient variance or convergence rates. - While Shuffle-R1 outperforms GRPO, DAPO, and GSPO, it does not compare with other data-centric RL methods such as curriculum-based sampling, which are relevant to the idea of reusing or reweighting data. A short di
1. The paper pinpoints two concrete and observable issues, "Advantage Collapsing and Rollout Silencing", which intuitively explain why current RL pipelines waste computation and fail to leverage informative signals. This diagnostic perspective is well-motivated. 2. Instead of modifying the reward model or policy objective, Shuffle-R1 improves RL efficiency purely from the data side through Pairwise Trajectory Sampling (PTS) and Advantage-based Batch Shuffle (ABS). Both modules are lightweight,
1. **Overfitting to high-advantage samples** Since ABS repeatedly exposes high-value trajectories, the framework might bias the model toward a narrower distribution of “reward-dense” samples, reducing exploration and long-term diversity. 2. **Scope of benchmarks** Most experiments are on math or visual reasoning tasks; while results are strong, these domains already have dense reward signals. It remains unclear whether Shuffle-R1 would bring similar benefits on tasks with sparse or noisy rewar
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
