TL;DR
This paper investigates the geometric structure of the modality gap in multimodal models, revealing anisotropic residuals as key obstacles, and proposes a correction framework, AnisoAlign, for better modality alignment using geometric priors.
Contribution
It uncovers the anisotropic nature of the modality gap and introduces a geometric correction method, AnisoAlign, for improved unimodal training of multimodal models.
Findings
Modality representations share compatible semantic geometry.
The modality gap is due to anisotropic residuals along dominant directions.
AnisoAlign improves geometric alignment and text-only multimodal training.
Abstract
Training multimodal large language models has long been limited by the scarcity of high-quality paired multimodal data. Recent studies show that the shared representation space of pretrained multimodal contrastive models can serve as a bridge, enabling models to perform multimodal training with unimodal data. However, the key premise of this paradigm remains insufficiently understood: can representations from different modalities be reliably interchanged? The core obstacle lies in the persistent Modality Gap in the shared space. In this work, we revisit the geometric nature of the modality gap. We find that modality representations already share compatible dominant semantic geometry. What truly hinders modality interchangeability is not a simple global shift, but an anisotropic residual structure concentrated along a small number of dominant directions. Based on this finding, we further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
