The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-modal Divergence
Yichao Cai, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi

TL;DR
This paper develops a measure-theoretic framework to analyze contrastive learning, revealing geometric regimes and the role of entropy in representation alignment and modality gaps, supported by synthetic and pretrained model experiments.
Contribution
It introduces a measure-theoretic approach to contrastive learning, characterizing geometric regimes and the interplay of alignment, entropy, and cross-modal divergence.
Findings
Unimodal regime has a convex energy with a unique Gibbs equilibrium.
Multimodal regime exhibits cross-coupled geometry with persistent modality gaps.
Entropy influences the shape of the energy landscape and alignment in representations.
Abstract
While InfoNCE underlies modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment--uniformity decomposition. We develop a measure-theoretic framework in which learning evolves representation measures on a fixed embedding manifold. In the large-batch limit, we prove value and gradient consistency, linking the stochastic objective to explicit deterministic energy landscapes and revealing a geometric bifurcation between unimodal and symmetric multimodal regimes. In the unimodal case, the intrinsic energy is strictly convex and admits a unique Gibbs equilibrium, showing that entropy acts as a tie-breaker within the aligned basin. In the multimodal case, the intrinsic geometry becomes cross-coupled and contains a persistent negative symmetric divergence term: each modality's marginal reshapes the effective landscape of the other, allowing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
