Modality-Balancing Preference Optimization of Large Multimodal Models by Adversarial Negative Mining
Chenxi Liu, Tianyi Xiong, Yanshuo Chen, Ruibo Chen, Yihan Wu, Junfeng Guo, Tianyi Zhou, Heng Huang

TL;DR
This paper introduces MBPO, a novel framework that balances modality in large multimodal models by generating adversarial negatives and using hybrid training, leading to improved reasoning and reduced hallucinations.
Contribution
It proposes a new preference optimization method that balances modalities in LMMs by combining adversarial negative mining with online reward verification, enhancing performance and robustness.
Findings
Improved performance on vision-language tasks.
Reduced hallucinations in LMMs.
Effective modality balancing demonstrated.
Abstract
The task adaptation and alignment of Large Multimodal Models (LMMs) have been significantly advanced by instruction tuning and further strengthened by recent preference optimization. Yet, most LMMs still suffer from severe modality imbalance during reasoning, i.e., outweighing language prior biases over visual inputs, which bottlenecks their generalization to downstream tasks and causes hallucinations. However, existing preference optimization approaches for LMMs do not focus on restraining the internal biases of their Large Language Model (LLM) backbones when curating the training data. Moreover, they heavily rely on offline data and lack the capacity to explore diverse responses adaptive to dynamic distributional shifts during training. Meanwhile, Group Relative Policy Optimization (GRPO), a recent method using online-generated data and verified rewards to improve reasoning…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1.Clear Motivation and Problem Definition: The paper convincingly identifies modality imbalance—where language priors overshadow visual information—as a critical limitation in current LMMs. The introduction of the Image Information Gain (IIG) metric to quantify visual contribution is conceptually elegant and practically useful. 2. Innovative Adversarial Negative Mining: Instead of random perturbations, MBPO employs adversarial attacks to construct meaningful hard negatives. This strategy effecti
1. Ablation and Parameter Analysis Limitations: While Table 2 and Table 3 do present several ablations, some aspects remain under-explored. For example, it’s unclear whether the observed gains predominantly stem from data curation (selecting better/harder negatives), the adversarial perturbation technique, or the hybrid GRPO objective. More fine-grained ablations, including IIG score cutoff sensitivity and adversarial attack hyperparameters, would help clarify these contributions and potential c
1. **Key Idea is not hard to understand.** The *adversarial-negative mining* pipeline is used to construct preference pairs that penalize language-only shortcuts, while GRPO supplies online, distribution-adaptive, verifiable signals that are robust to reward hacking on free-form answers. 2. **Positioning vs. preference learning.** The paper rightly notes that **DPO** can under-use images in multimodal settings, and that mDPO explicitly tackles “unconditional preference” effects.
1. **Unclear Motivation & Presentation.** The motivation is wired to me. The claim that current LMMs can neglect visual information is acceptable somehow, but what is the point of introducing GRPO? Why do we need online-offline mixing mode for preference optimization given your claim? Also, the writing and presentation in implementation details in the whole paper is not clear. 2. **Attribution of gains across components is under-specified.** The paper should isolate and quantify the marginal
* The manuscript is well written and easy to follow with clear technical implementation details and the used data and models are openly available, facilitating reproducibility. * The proposed method seems to produce consistent and non trivial improvements over relatively strong (i.e. already relatively well aligned) baselines (~+1% on Qwen2-VL-7B and ~+1.2% on Qwen2.5-VL-7B), despite minimal data requirements.
* Results in table 1 and 2 report “averages” which appear to actually be sums. Additionally, MME and MMHal* need to be normalized from 0-100 (instead of 0-2000 and 0-6) to allow for proper averaging. * Most prior alignment work, including the works cited in this paper, report results for aligning LLaVA 1.5, which allows more direct comparison between methods In this work, the authors only report results on the more recent Qwen2 and Qwen2.5 base models. For direct comparison one must therefore re
1) Modality imbalance is a serious and often-overlooked issue in multimodal alignment. The paper targets it directly with a concrete and well-designed optimization framework. 2) Using adversarial image perturbations to expose the model’s overreliance on text priors is an elegant way to produce meaningful “hard negatives.” The Image Information Gain (IIG) metric is a nice touch for quantifying visual grounding. 3) The experiments are extensive, covering both general vision-language understandin
1) The online dataset only uses around 2k closed-ended samples, which feels modest. It would be interesting to see whether scaling this part further would continue to improve results or plateau. 2) The IIG metric is intuitive but could benefit from more analysis. For instance, how correlated is IIG with human judgments of visual grounding?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsFocus
