VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization

Mingxiao Li; Na Su; Fang Qu; Zhizhou Zhong; Ziyang Chen; Yuan Li; Zhaopeng Tu; Xiaolong Li

arXiv:2505.10917·cs.CV·May 20, 2025

VISTA: Enhancing Vision-Text Alignment in MLLMs via Cross-Modal Mutual Information Maximization

Mingxiao Li, Na Su, Fang Qu, Zhizhou Zhong, Ziyang Chen, Yuan Li, Zhaopeng Tu, Xiaolong Li

PDF

Open Access

TL;DR

This paper identifies limitations in current multimodal large language models' modality alignment and introduces VISTA, a method that explicitly maximizes cross-modal mutual information to improve vision-text alignment without extra training data.

Contribution

The paper provides an information-theoretic analysis of existing loss functions and proposes VISTA, a novel approach that enhances vision-text alignment by maximizing mutual information.

Findings

01

VISTA significantly improves performance on multiple benchmarks.

02

Theoretical analysis reveals limitations of cross-entropy loss in modality alignment.

03

VISTA enhances visual understanding without additional training modules.

Abstract

Current multimodal large language models (MLLMs) face a critical challenge in modality alignment, often exhibiting a bias towards textual information at the expense of other modalities like vision. This paper conducts a systematic information-theoretic analysis of the widely used cross-entropy loss in MLLMs, uncovering its implicit alignment objective. Our theoretical investigation reveals that this implicit objective has inherent limitations, leading to a degradation of cross-modal alignment as text sequence length increases, thereby hindering effective multimodal information fusion. To overcome these drawbacks, we propose Vision-Text Alignment (VISTA), a novel approach guided by our theoretical insights. VISTA introduces an explicit alignment objective designed to maximize cross-modal mutual information, preventing the degradation of visual alignment. Notably, VISTA enhances the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques