Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models

Hanqi Yan; Xiangxiang Cui; Lu Yin; Jindong Gu; Paul Pu Liang; Yulan He; Yifei Wang

arXiv:2502.14888·cs.CV·April 28, 2026

Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models

Hanqi Yan, Xiangxiang Cui, Lu Yin, Jindong Gu, Paul Pu Liang, Yulan He, Yifei Wang

PDF

TL;DR

This paper introduces the Modality Dominance Score to measure modality gaps in vision-language models, proposes interpretability metrics, and demonstrates training-free model editing to improve downstream tasks.

Contribution

It presents a novel metric for quantifying modality-specific features and shows how training-free editing can enhance various multimodal applications.

Findings

01

Modality Dominance Score effectively categorizes features into vision, language, or cross-modal.

02

Interpretability metrics enable scalable evaluation of modality-specific features.

03

Training-free model editing improves bias mitigation, adversarial robustness, and modality control.

Abstract

The success of vision-language models is primarily attributed to effective alignment across modalities such as vision and language. However, modality gaps persist in existing alignment algorithms and appear necessary for human perception as evidenced by modality-specific phenomena like visual texture and linguistic tone. These observations motivate us to computationally measure and leverage modality gaps to improve downstream tasks. We first introduce the Modality Dominance Score (MDS), which attributes multimodal features to specific modalities by categorizing them into three classes: vision-dominant features, language-dominant features, and cross-modal features. We then propose automatic interpretability metrics to evaluate these modality-specific features in a scalable manner. Finally, we demonstrate that the training-free model editing enhances multiple downstream tasks, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.