The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment
Hongyuan Liu, Qinli Yang, Wen Li, Zhong Zhang, Jiaming Liu, Wei Han, Zhili Qin, Jinxia Guo, Junming Shao

TL;DR
This paper analyzes the geometric nature of the modality gap in vision-language models and introduces a curriculum-based fine-tuning method to significantly improve cross-modal alignment and task performance.
Contribution
It decomposes the modality gap into centroid and distribution components and proposes TPC-CMA, a curriculum fine-tuning framework that effectively reduces both components.
Findings
Distribution gap is the main predictor of cross-modal task quality ($R^2=0.986$).
Our method reduces the modality gap by up to 82.3%.
Clustering ARI and captioning metrics improve significantly after fine-tuning.
Abstract
Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the underlying distributional mismatch intact. We decompose the modality gap into a Centroid Gap and a Distribution Gap, and demonstrate that the Distribution Gap is the true predictor of cross-modal task quality (), whereas the commonly used Raw Gap is misleading (). Motivated by this observation, we propose TPC-CMA (Three-Phase Curriculum for Cross-Modal Alignment), a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
