RankCLIP: Ranking-Consistent Language-Image Pretraining
Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding,, Yining Sun

TL;DR
RankCLIP introduces a ranking-based pretraining approach that captures complex many-to-many relationships in vision-language models, significantly improving zero-shot classification performance over existing methods.
Contribution
It extends CLIP's pairwise loss to list-wise ranking, enabling better modeling of nuanced relationships between images and texts.
Findings
Achieves significant zero-shot classification improvements
Effectively models complex many-to-many relationships
Outperforms state-of-the-art methods in downstream tasks
Abstract
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pre-training method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RankCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art…
Peer Reviews
Decision·Submitted to ICLR 2025
- Reasonable Motivation: The paper identifies the limitations of existing models like CLIP, discussing the importance of capturing many-to-many relationships in multimodal data, which provides a reasonable motivation for the development of RANKCLIP. - Extensive Evaluation: RANKCLIP is evaluated across a variety of downstream tasks, including zero-shot image classification, retrieval, and linear probe classification, showing its applicability in different contexts.
- Lack of Discussion on Related Works: The paper does not adequately discuss other works that also aim to construct many-to-many relationships in vision-language pretraining. For example, [1] proposed a progressive self-distillation method that uses image-to-text logits (and vice versa) as targets, while [2] introduced in-modal consistency. - Lack of Novelty: RANKCLIP closely resembles the method described in [1], raising questions about its novelty. - Misaligned Experiment Settings: The exper
1. RANKCLIP is designed to handle the tricky many-to-many relationships between images and text. Instead of just looking at pairs in isolation, it uses a ranking approach to enhance model performance. 2. When testing against data that’s a little different than what it was trained on, RANKCLIP still holds up well. It also has a knack for understanding semantic nuances, making it better at tasks like image-text retrieval.
1. Although it performs well on variants of ImageNet1K with natural distribution shifts, its top-3 and top-5 accuracy on CIFAR-10 is even lower than that of CLIP. 2. The comparison includes too few SOTA methods; additional methods such as CyCLIP and SoftCLIP should be included to convincingly demonstrate the superiority of the proposed method.
- **Novelty of Rank Consistency Objective**: The rank consistency objective appears to be novel, as existing methods like the CLIP objective rely on an instance discrimination task focused on distinguishing positive examples from negatives. While embedding-level correlations between similar samples are known to emerge naturally from this approach, no method to date has directly leveraged this correlation. The proposed rank consistency loss effectively utilizes this inherent similarity, potential
- **Integration with the Original CLIP Objective**: While the method improves experimental results, further analysis could clarify how the proposed rank consistency objective interacts with the original CLIP objective. For instance, it would be helpful to understand the balance between the two objectives, or if the rank consistency objective alone could effectively learn cross-modal alignment embeddings. This discussion is currently lacking. - **Limited Ablation Study on Loss Components**: The
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Cancer-related molecular mechanisms research · Domain Adaptation and Few-Shot Learning
MethodsSparse Evolutionary Training · Contrastive Learning · Contrastive Language-Image Pre-training
