The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment

Hongyuan Liu; Qinli Yang; Wen Li; Zhong Zhang; Jiaming Liu; Wei Han; Zhili Qin; Jinxia Guo; Junming Shao

arXiv:2604.00279·cs.CV·April 2, 2026

The Geometry of Compromise: Unlocking Generative Capabilities via Controllable Modality Alignment

Hongyuan Liu, Qinli Yang, Wen Li, Zhong Zhang, Jiaming Liu, Wei Han, Zhili Qin, Jinxia Guo, Junming Shao

PDF

TL;DR

This paper analyzes the geometric nature of the modality gap in vision-language models and introduces a curriculum-based fine-tuning method to significantly improve cross-modal alignment and task performance.

Contribution

It decomposes the modality gap into centroid and distribution components and proposes TPC-CMA, a curriculum fine-tuning framework that effectively reduces both components.

Findings

01

Distribution gap is the main predictor of cross-modal task quality ($R^2=0.986$).

02

Our method reduces the modality gap by up to 82.3%.

03

Clustering ARI and captioning metrics improve significantly after fine-tuning.

Abstract

Vision-Language Models (VLMs) such as CLIP learn a shared embedding space for images and text, yet their representations remain geometrically separated, a phenomenon known as the modality gap. This gap limits tasks requiring cross-modal interchangeability, such as captioning and joint clustering. Existing post-processing approaches can partially improve cross-modal compatibility; however, we show through geometric analysis that they primarily reduce the global centroid offset while leaving the underlying distributional mismatch intact. We decompose the modality gap into a Centroid Gap and a Distribution Gap, and demonstrate that the Distribution Gap is the true predictor of cross-modal task quality ( $R^{2} = 0.986$ ), whereas the commonly used Raw Gap is misleading ( $R^{2} = 0.691$ ). Motivated by this observation, we propose TPC-CMA (Three-Phase Curriculum for Cross-Modal Alignment), a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.