VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization
Mingxiao Li, Na Su, Fang Qu, Zhizhou Zhong, Ziyang Chen, Yuan Li, Zhaopeng Tu, Xiaolong Li

TL;DR
This paper identifies limitations in current multimodal large language models' modality alignment and introduces VISTA, a method that explicitly maximizes cross-modal mutual information to improve vision-text alignment without extra training data.
Contribution
The paper provides an information-theoretic analysis of existing loss functions and proposes VISTA, a novel approach that enhances vision-text alignment by maximizing mutual information.
Findings
VISTA significantly improves performance on multiple benchmarks.
Theoretical analysis reveals limitations of cross-entropy loss in modality alignment.
VISTA enhances visual understanding without additional training modules.
Abstract
Current multimodal large language models (MLLMs) face a critical challenge in modality alignment, often exhibiting a bias towards textual information at the expense of other modalities like vision. This paper conducts a systematic information-theoretic analysis of the widely used cross-entropy loss in MLLMs, uncovering its implicit alignment objective. Our theoretical investigation reveals that this implicit objective has inherent limitations, leading to a degradation of cross-modal alignment as text sequence length increases, thereby hindering effective multimodal information fusion. To overcome these drawbacks, we propose Vision-Text Alignment (VISTA), a novel approach guided by our theoretical insights. VISTA introduces an explicit alignment objective designed to maximize cross-modal mutual information, preventing the degradation of visual alignment. Notably, VISTA enhances the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
