Towards Uniformity and Alignment for Multimodal Representation Learning
Wenzhe Yin, Pan Zhou, Zehao Xiao, Jie Liu, Shujian Yu, Jan-Jakob Sonke, Efstratios Gavves

TL;DR
This paper identifies conflicts in multimodal representation learning caused by InfoNCE objectives and proposes a decoupled approach to improve alignment and uniformity, leading to better cross-modal tasks.
Contribution
It introduces a novel decoupling method for alignment and uniformity in multimodal learning, backed by theoretical guarantees and extensive experimental validation.
Findings
Reduces distribution gaps across modalities.
Improves retrieval and generation performance.
Provides theoretical guarantees for the method.
Abstract
Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The work addresses a potential roadblock in contrastive learning, i.e., the possibility that the alignment of positive tuples being interfered with by the uniformity criterion in contrastive learning. 2. The authors do a good job in conveying the intuition behind the problem that they are tackling. 3. For the kind of alignment-uniformity conflict presented in their Assumption 1, the authors theoretically prove that it increases as the number of modalities grow.
1. Although the authors argue for the existence of a cross-modality uniformity conflict which oppose alignment, there is no clear evidence for this, either in the existing literature or in this work. For instance, in Line 104, the authors refer to learning, Yin et al. (2025), mentioning that "clearly demonstrate that uniformity across modalities (“inter-uniformity”) conflicts with the alignment term". However, I found no results as such upon going through that work. The claim can be described as
1.The paper is clearly written, presenting the motivation, conflicts, and proposed solution in a well-structured manner. 2..The proposed method is theoretically grounded and experimentally validated, demonstrating robust performance across multimodal retrieval and generation tasks.
1.There are some writing errors, such as “InfoNEC” in Section 2.2. 2.Although anchor-based alignment eliminates cross-modal rejection, it introduces modal bias. I have a question: Could the selection of different anchor modalities lead to representation imbalance? 3.Your approach employs intra-modal consistency to prevent representation collapse. Was modal collapse considered during training? 4.The proposed global Hölder divergence is defined over multiple modality distributions. Is this diverge
1. The paper tackles a relevant and well-known problem — the modality gap in multimodal contrastive learning — and extends previous analyses from the bimodal to the general multimodal case. 2. The proposed framework is straightforward to implement, requiring no architectural modifications or additional modules. 3. The empirical results are encouraging, showing that the proposed method consistently outperforms existing alternatives.
## Major 1. **Theoretical clarity.** The theoretical section is difficult to follow, and several key definitions are vague or insufficiently motivated. In particular: - In Eq. (3), it is unclear why $V_a$ represents the alignment force and $\Phi_a$ the uniformity force — this connection should be explained more carefully. - Assumption 1 is introduced without justification; it is not obvious why it should hold or be meaningful in practice. - In the boxed text of Section 3.1, the
The idea sounds and wisely mixes two crucial topics that are scaling multimodal learning to mroe modalities and reduce the modality gap between modality representations. I appreciate the colored summary box at page 4, as it helps fix the main concepts. The results in the generation task are convincing and interesting. Overall, the paper is interesting!
W1) In the box and throughout the paper, the authors say that it is crucial to have intra-modality uniformity and conflict-free alignment. Subsequently, they introduce the uniformity loss U(Z) and the L_align. However, they further add to the total loss the centroid uniformity, why? In this way, if I understood correctly, the total loss has two uniformity terms (U(Z) and (U(C)), plus the align term and the gram volume. I am afraid that the contribution of the uniformity losses may be too strong
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
