Closing the Modality Gap Aligns Group-Wise Semantics

Eleonora Grassucci; Giordano Cicchetti; Emanuele Frasca; Aurelio Uncini; Danilo Comminiello

arXiv:2601.18525·cs.LG·January 27, 2026

Closing the Modality Gap Aligns Group-Wise Semantics

Eleonora Grassucci, Giordano Cicchetti, Emanuele Frasca, Aurelio Uncini, Danilo Comminiello

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the modality gap in multimodal learning, demonstrating that reducing this gap significantly improves group-wise semantic tasks like clustering, while having limited effect on instance-wise tasks such as retrieval.

Contribution

The paper introduces a novel method to reduce the modality gap in multimodal spaces and shows its effectiveness in enhancing group-wise semantic tasks.

Findings

01

Reducing the modality gap improves group-wise tasks significantly.

02

The gap has limited impact on instance-wise tasks like retrieval.

03

The method extends to multiple modalities beyond two.

Abstract

In multimodal learning, CLIP has been recognized as the \textit{de facto} method for learning a shared latent space across multiple modalities, placing similar representations close to each other and moving them away from dissimilar ones. Although CLIP-based losses effectively align modalities at the semantic level, the resulting latent spaces often remain only partially shared, revealing a structural mismatch known as the modality gap. While the necessity of addressing this phenomenon remains debated, particularly given its limited impact on instance-wise tasks (e.g., retrieval), we prove that its influence is instead strongly pronounced in group-level tasks (e.g., clustering). To support this claim, we introduce a novel method designed to consistently reduce this discrepancy in two-modal settings, with a straightforward extension to the general $n$ -modal case. Through our extensive…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The observation that modality gap does not necessarily affect fine-grained tasks like retrieval in some cases but could affect group-level tasks like clustering is a novel contribution. 2. The idea of applying cluster-based contrastive losses to close the modality gap in multimodal models is an interesting. 3. The motivation of the paper is clear and the authors do a good job conveying intuitions for some of their claims. 4. Indeed as the authors claim, their empirical results confirm that

Weaknesses

1. In Section 3.2, the authors mention that one of the reasons contributing to modality gap is the use of networks with different random initialization for different modalities? How about the scenario where weights are shared across modalities (all modalities are encoded via the same network), since that is the way several widely used multimodal models are trained? Furthermore, although the authors claim this to be one of the sources of modality gap, their solution to resolving it only revolves

Reviewer 02Rating 6Confidence 5

Strengths

1. The paper clearly revisited an important distinction between instance-wise and group-wise tasks in multimodal learning, providing a new perspective on the modality gap phenomenon. 2. The theoretical analysis offers a reasonable explanation for why the modality gap affects clustering performance but not retrieval, with the derivation showing how the gap inflates within-class scatter. 3. The paper provides thorough experimental validation across diverse datasets and modalities, with clear visua

Weaknesses

1. The core idea of combining alignment and uniformity losses is not fundamentally new. Some methods previously established that contrastive learning involves alignment and uniformity, and explored closing the modality gap with similar objectives. The specific combination of LATP and LCU doesn't represent a significant methodological advance. 2. The paper states that "closing the modality gap" is necessary for group-wise tasks, but the marginal improvements on some datasets (like CIFAR-10) sugge

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper tackles a timely and underexplored problem in multimodal contrastive learning — the **modality gap** — and offers an insightful analysis of its effects. 2. The distinction between *instance-wise* and *group-wise* consequences of the modality gap is both novel and intuitively well-motivated. 3. The proposed loss functions are simple, differentiable, and easily integrated into existing CLIP-style frameworks. 4. The experimental validation is extensive, spanning multiple bi- and

Weaknesses

## Major 1. **Limited theoretical grounding.** - The main theoretical contribution — that the modality gap influences group-wise semantics — is persuasive but remains largely qualitative. A more formal treatment (e.g., linking it to the alignment–uniformity trade-offs in contrastive theory) would strengthen the argument. - The principal practical novelty appears to be Equation (8), as Equation (7) corresponds to the standard “alignment” term introduced by Wang & Isola (2020). Althoug

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling