DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
Haowen Gao, Zhenyu Zhang, Liang Pang, Fangda Guo, Hongjian Dou, Guannan Lv, Shaoguo Liu, Tingting Gao, Huawei Shen, Xueqi Cheng

TL;DR
DIVA-GRPO introduces a difficulty-adaptive approach to improve reinforcement learning for multimodal reasoning models by dynamically adjusting variant difficulty, addressing reward sparsity and advantage vanishing issues, and enhancing training stability and performance.
Contribution
It proposes DIVA-GRPO, a novel method that adaptively adjusts difficulty levels in group relative policy optimization for better reasoning model training.
Findings
Outperforms existing methods in six reasoning benchmarks.
Improves training efficiency and reasoning accuracy.
Reduces reward sparsity and advantage vanishing issues.
Abstract
Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates…
Peer Reviews
Decision·ICLR 2026 Poster
1. The method is thoroughly developed, with detailed algorithms, theoretical analysis (e.g., variance reduction and convergence), and extensive empirical validation across multiple benchmarks. 2. The idea of dynamically adjusting problem difficulty and generating semantically consistent variants is well-motivated.
1. The method introduces new hyperparameters (e.g., difficulty scaling factor k, learning rate η) that lack robust ablation or automatic tuning. This may hinder out-of-the-box usability and necessitate per-task calibration. 2. The training flow illustrated in Figure 2 is somewhat confusing. A clearer depiction of how variants are sampled, advantages computed, and the policy updated would improve understanding. 3. The difficulty of each query is updated after every epoch, but the total number of
Method is clearly presented and well-motivated: The paper crisply identifies advantage vanishing and reward sparsity in GRPO, then designs a pipeline that directly targets these failure modes. Thorough experiments and diagnostics: Results span six benchmarks with ablations and a neat speedup study vs GRPO; figures/tables are informative. Generalizable component (RRB): The RRB trick improves vanilla GRPO too, suggesting portability beyond this specific framework.
Unclear significance under matched baselines: Gains over strong GRPO variants are modest or inconsistent, and baseline setups do not appear strictly aligned. Scope and transfer are not well positioned: It is unclear whether the difficulty-adaptive variant + local/global advantage scheme is truly first for MLLMs, how it relates to similar ideas, and whether the recipe generalizes to text-only GRPO without images. Figure interpretation ambiguity: The GRPO training curves in Fig. 3(b) vs. Fig. 3(
1. The paper is generally well-written and easy to follow, with a clear description of the method. 2. The paper provides intuitive visual demonstrations to help better understand the paper.
1. **Clarify local vs. global advantage magnitudes.** In `Line 263-264`, the paper argues that because global advantages are computed over `m×k` samples, their magnitudes differ from local advantages. However, in `Line 250-252` the global advantage is also normalized, which should mitigate raw magnitude discrepancies. Moreover, the subsequent difficulty-weighted scaling implies the effective magnitude of the global term should also depend on the direction and strength of the difficulty adjustmen
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Topic Modeling
