Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models
Hanqi Yan, Xiangxiang Cui, Lu Yin, Jindong Gu, Paul Pu Liang, Yulan He, Yifei Wang

TL;DR
This paper introduces the Modality Dominance Score to measure modality gaps in vision-language models, proposes interpretability metrics, and demonstrates training-free model editing to improve downstream tasks.
Contribution
It presents a novel metric for quantifying modality-specific features and shows how training-free editing can enhance various multimodal applications.
Findings
Modality Dominance Score effectively categorizes features into vision, language, or cross-modal.
Interpretability metrics enable scalable evaluation of modality-specific features.
Training-free model editing improves bias mitigation, adversarial robustness, and modality control.
Abstract
The success of vision-language models is primarily attributed to effective alignment across modalities such as vision and language. However, modality gaps persist in existing alignment algorithms and appear necessary for human perception as evidenced by modality-specific phenomena like visual texture and linguistic tone. These observations motivate us to computationally measure and leverage modality gaps to improve downstream tasks. We first introduce the Modality Dominance Score (MDS), which attributes multimodal features to specific modalities by categorizing them into three classes: vision-dominant features, language-dominant features, and cross-modal features. We then propose automatic interpretability metrics to evaluate these modality-specific features in a scalable manner. Finally, we demonstrate that the training-free model editing enhances multiple downstream tasks, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
