Information Router for Mitigating Modality Dominance in Vision-Language Models
Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib

TL;DR
This paper introduces MoIR, a multi-modal information routing method that explicitly balances modality contributions by enriching token representations, improving robustness and performance in vision-language models under modality imbalance.
Contribution
MoIR is a novel information-level fusion technique that reduces modality disparity before fusion, enabling better handling of degraded or unbalanced modalities in vision-language models.
Findings
MoIR achieves more balanced modality contributions across benchmarks.
It improves robustness and downstream performance, especially with modality degradation.
Experimental results confirm the effectiveness of explicit information modification in multi-modal reasoning.
Abstract
Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model's attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
