Ranking-aware adapter for text-driven image ordering with CLIP
Wei-Hsiang Yu, Yen-Yu Lin, Ming-Hsuan Yang, Yi-Hsuan Tsai

TL;DR
This paper introduces a ranking-aware adapter for CLIP that enhances text-guided image ranking by incorporating visual differences and learnable prompts, outperforming fine-tuned CLIP and competing with specialized models.
Contribution
It proposes a lightweight adapter with ranking-aware attention for CLIP, enabling effective learning-to-rank for image ordering with minimal task-specific prompting.
Findings
Outperforms fine-tuned CLIP on various ranking tasks
Achieves competitive results with task-specific models
Provides a generalized approach for image ranking with a single instruction
Abstract
Recent advances in vision-language models (VLMs) have made significant progress in downstream tasks that require quantitative concepts such as facial age estimation and image quality assessment, enabling VLMs to explore applications like image ranking and retrieval. However, existing studies typically focus on the reasoning based on a single image and heavily depend on text prompting, limiting their ability to learn comprehensive understanding from multiple images. To address this, we propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task and introduces a lightweight adapter to augment CLIP for text-guided image ranking. Specifically, our approach incorporates learnable prompts to adapt to new instructions for ranking purposes and an auxiliary branch with ranking-aware attention, leveraging text-conditioned visual differences for additional…
Peer Reviews
Decision·ICLR 2025 Poster
- *Generalizable Across Ranking Tasks*: The ranking-aware adapter is designed to handle multiple text-driven ranking tasks (e.g., age estimation, object counting, image quality assessment) without requiring extensive task-specific tuning. This flexibility suggests that the approach could generalize across diverse ranking applications, potentially making it adaptable for other vision-language tasks where relative comparisons matter. - *Improved Performance on Benchmarks*: the proposed method imp
- **Clarity and Reproducibility Issues**: The paper is challenging to follow, with several instances where the context is unclear, and symbols (e.g., Eq(5) and ΔO) are introduced without proper explanation. This lack of clarity makes it difficult to fully understand the method and poses challenges for reproducing the results. In particular, additional context is recommended regarding: - The specific role and application of the ranking score across different tasks. - Which of the two heads
1. The writing is clear and well-structured, making the content easy to follow. The logical flow helps readers grasp the key concepts without difficulty. 2. The motivation behind the study is explained exceptionally well. It highlights that existing research often centers on reasoning from a single image and relies heavily on text prompts. This approach restricts the ability to achieve a comprehensive understanding when multiple images are involved. By addressing these limitations, the study ai
1. Comparison with Existing Methods: There is a need to clearly delineate the core differences between OrdinalCLIP, L2RCLIP, and NumCLIP compared to existing methods, which might not be fully addressed. 2. State-of-the-Art Comparisons: The article does not adequately compare the proposed models to state-of-the-art multi-modal large language models (LLMs), which could provide a more comprehensive evaluation of their performance. 3. Performance on Complex Benchmarks: The performance on complex c
1. The motivation is clear: while previous methods require generating multiple captions for input images, this approach only needs a single rank-related text prompt. 2. The proposed ranking-aware adapter achieves superior performance over fine-tuned CLIP and is competitive with specialized models for tasks like facial age estimation and image quality assessment, offering a versatile solution for text-guided image ranking.
1. Although the motivation of this paper is clear, the technical contribution does not appear strong enough from my perspective. I will wait to see other reviewers’ comments on this aspect. 2. In Equation (2), the symbols $V_i$ and $V_j$ are not explained. Adding the shapes or dimensions of certain symbols in Equation (2) would enhance clarity. 3. Regarding the experiments: why does ranking-aware attention utilize three MLP blocks? Minor Issues: - Consider adding count numbers to the results in
Code & Models
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training · Focus · Adapter
