VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL

Zengjie Hu; Jiantao Qiu; Tianyi Bai; Haojin Yang; Binhang Yuan; Qi Jing; Conghui He; Wentao Zhang

arXiv:2511.18902·cs.LG·November 25, 2025

VADE: Variance-Aware Dynamic Sampling via Online Sample-Level Difficulty Estimation for Multimodal RL

Zengjie Hu, Jiantao Qiu, Tianyi Bai, Haojin Yang, Binhang Yuan, Qi Jing, Conghui He, Wentao Zhang

PDF

Open Access 1 Datasets

TL;DR

VADE introduces an online, variance-aware dynamic sampling method that improves training efficiency and effectiveness in multimodal reinforcement learning by selecting the most informative samples in real-time.

Contribution

It proposes a novel framework combining online difficulty estimation, Thompson sampling, and prior decay to enhance sample selection without extra rollout costs.

Findings

01

Outperforms baselines in multimodal reasoning benchmarks

02

Achieves higher sample efficiency and training performance

03

Reduces computational overhead significantly

Abstract

Group-based policy optimization methods like GRPO and GSPO have become standard for training multimodal models, leveraging group-wise rollouts and relative advantage estimation. However, they suffer from a critical \emph{gradient vanishing} problem when all responses within a group receive identical rewards, causing advantage estimates to collapse and training signals to diminish. Existing attempts to mitigate this issue fall into two paradigms: filtering-based and sampling-based methods. Filtering-based methods first generate rollouts broadly and then retroactively filter out uninformative groups, leading to substantial computational overhead. Sampling-based methods proactively select effective samples before rollout but rely on static criteria or prior dataset knowledge, lacking real-time adaptability. To address these issues, we propose \textbf{VADE}, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

FloSophorae/VADE-Dataset
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)