Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

Jinghan Li; Junfeng Fang; Jinda Lu; Yuan Wang; Xiaoyan Guo; Tianyu Zhang; Xiang Wang; Xiangnan He

arXiv:2602.21743·cs.CV·February 27, 2026

Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization

Jinghan Li, Junfeng Fang, Jinda Lu, Yuan Wang, Xiaoyan Guo, Tianyu Zhang, Xiang Wang, Xiangnan He

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Durian, a difficulty-aware group normalization method that improves multimodal reasoning in large language models by reducing sensitivity to extreme samples, leading to better performance on benchmarks.

Contribution

It proposes a novel normalization technique that groups samples by difficulty, enhancing stability and reasoning ability in multimodal large language models.

Findings

01

Significant performance improvements on multimodal reasoning benchmarks.

02

Durian reduces sensitivity to extreme samples and stabilizes training.

03

Effective in preserving intra-group distinctions while improving reasoning.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) and Group Relative Policy Optimization (GRPO) have significantly advanced the reasoning capabilities of large language models. Extending these methods to multimodal settings, however, faces a critical challenge: the instability of std-based normalization, which is easily distorted by extreme samples with nearly positive or negative rewards. Unlike pure-text LLMs, multimodal models are particularly sensitive to such distortions, as both perceptual and reasoning errors influence their responses. To address this, we characterize each sample by its difficulty, defined through perceptual complexity (measured via visual entropy) and reasoning uncertainty (captured by model confidence). Building on this characterization, we propose difficulty-aware group normalization (Durian), which re-groups samples by difficulty levels and shares the std…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper presents a well-motivated approach to improve GRPO stability by dividing samples into difficulty-based groups and computing group-specific normalization factors. 2. The use of image entropy to quantify perceptual difficulty is an insightful way to model visual reasoning complexity.

Weaknesses

I appreciate the work. Here are my comments to improve further, 1. The paper’s core premise—that std-based group normalization is highly sensitive to extreme samples—lacks concrete empirical evidence. The authors should analyze when and how such cases occur, as infrequent occurrences might naturally average out during training, reducing the claimed impact of this issue. 2. The use of image entropy as a proxy for perceptual reasoning difficulty is not clearly justified. While entropy of eigenva

Reviewer 02Rating 2Confidence 4

Strengths

- The paper is overall clear and easy to follow. - To the reviewer's knowledge, the proposed difficulty-based regrouping method is novel and conceptually straightforward.

Weaknesses

- The definition of perceptual difficulty is not well validated. Why are images with higher entropy of eigenvalues of their image features considered perceptually harder? Moreover, why is perceptual difficulty assumed to be LLM-agnostic? Similarly, for reasoning difficulty, why is entropy used instead of a more direct measure such as average accuracy? - Lack sufficient ablation analysis. - How sensitive are the results to the weighting coefficients of different normalized advantages? -

Reviewer 03Rating 4Confidence 4

Strengths

- This paper identifies the abnormal impact of variance on advantage estimation in MLLM reinforcement learning when the number of rollouts is insufficient, and proposes several mitigation strategies. - It achieves accuracy improvements on MathVista, MathVision, and WeMath benchmarks. - It attains comparable or superior performance to prior work while using only 2K training samples.

Weaknesses

- The core idea of the paper is to address the bias in reward normalization under a limited number of rollouts by combining batch-level processing with difficulty grouping. However, it does not compare with methods that directly modify the normalization strategy, such as batch-level normalization (e.g., Reinforce++) or no normalization (e.g., Dr.GRPO). - It remains unclear how performance differs when the number of rollouts increases or decreases (e.g., from 8 to 32 or 2), where the variance eff

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Speech and dialogue systems