TL;DR
This paper identifies a modality gap in contrastive vision-language models like CLIP for mixed modality search and introduces GR-CLIP, a calibration method that significantly improves retrieval performance with minimal additional computation.
Contribution
The paper reveals the modality gap issue in CLIP for mixed modality search and proposes GR-CLIP, a simple post-hoc calibration technique that enhances performance and efficiency.
Findings
GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP.
GR-CLIP surpasses recent generative models by 4 percentage points.
GR-CLIP uses 75x less compute than comparable methods.
Abstract
Mixed modality search -- retrieving information across a heterogeneous corpus composed of images, texts, and multimodal documents -- is an important yet underexplored real-world application. In this work, we investigate how contrastive vision-language models, such as CLIP, perform on the mixed modality search task. Our analysis reveals a critical limitation: these models exhibit a pronounced modality gap in the embedding space, where image and text embeddings form distinct clusters, leading to intra-modal ranking bias and inter-modal fusion failure. To address this issue, we propose GR-CLIP, a lightweight post-hoc calibration method that removes the modality gap in CLIP's embedding space. Evaluated on MixBench -- the first benchmark specifically designed for mixed modality search -- GR-CLIP improves NDCG@10 by up to 26 percentage points over CLIP, surpasses recent vision-language…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper makes valuable contributions through 1. formulating an important underexplored problem (characterization of ranking bias and fusion failure in retrieval context) 2. creating a well-designed benchmark (introducing MixBench) 3. demonstrating a simple, effective, and reproducible method (application to mixed modality retrieval 4. conducting systematic evaluation across models, datasets, and modalities (systematic evaluation demonstrating NDCG@10 improvements)
1. The core method is not novel. It directly applies mean-shift calibration proposed by “Diagnosing and Rectifying Vision Models” (Zhang et al., ICLR 2023) and formalized in “Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data” (Zhang et al., ICLR 2024). The prior work explicitly states: "We propose a simple technique to close the modality gap... During training, instead of feeding x to the model h, we feed it with x − E_x[x]." and the paper inadequately acknowledges this.
1 - The paper addresses the known modality gap issue, clearly positions the contribution within mixed-modality search and gives a clear course of action to solve it. 2 - The proposed method, GR-CLIP, is both simple and effective, as it can be easily applied to various models and doesn’t require extensive resources. 3 - Introduction of a new benchmark for mixed-modality search, outlining the limitations of current models and the effectiveness of the proposed method. 4 - Comprehensive empiric
1 - The method presented in the paper lacks novelty, as the modality gap has already been studied in previous work. 2 - Lacking baselines to compare against, it is challenging to assess the relevance of GR-CLIP relative to other calibration methods. 3 - The theoretical explanations are limited; the paper would benefit from an analysis of where the gap originates and on which dataset it might be absent.
1. The paper tackles a well-defined and important problem—mixed modality search. The identification of the modality gap and the proposed post-hoc calibration method are interesting contributions to the field of multimodal retrieval. 2. The authors perform extensive evaluations on several datasets and compare GR-CLIP with both CLIP and state-of-the-art generative methods, such as VLM2Vec, which adds credibility to their claims.
* The authors address the issue of image and text alignment in CLIP and attempt to overcome the gap between embeddings of different modalities. However, this is a classic problem, and the authors fail to compare their method with some important approaches. For instance, UniVL-DR [1] extends images with captions and jointly encodes both the image and caption to address the embedding gap. The authors should compare their method with these approaches. * While CLIP's extracted embeddings indeed exhi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
