When Language Overrules: Revealing Text Dominance in Multimodal Large Language Models
Huyu Wu, Meng Tang, Xinhan Zheng, Haiyun Jiang

TL;DR
This paper systematically investigates the pervasive issue of text dominance in multimodal large language models, revealing causes and proposing a token compression method to rebalance modality attention.
Contribution
It introduces the first comprehensive analysis of text dominance across multiple modalities and proposes a simple token compression technique to mitigate this imbalance.
Findings
Text dominance is significant across all tested modalities.
Token redundancy and architecture influence modality imbalance.
Token compression effectively reduces text dominance in models.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities across a diverse range of multimodal tasks. However, these models suffer from a core problem known as text dominance: they depend heavily on text for their inference, while underutilizing other modalities. While prior work has acknowledged this phenomenon in vision-language tasks, often attributing it to data biases or model architectures. In this paper, we conduct the first systematic investigation of text dominance across diverse data modalities, including images, videos, audio, time-series, and graphs. To measure this imbalance, we propose two evaluation metrics: the Modality Dominance Index (MDI) and the Attention Efficiency Index (AEI). Our comprehensive analysis reveals that text dominance is both significant and pervasive across all tested modalities. Our in-depth analysis identifies three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
