Finetuning CLIP to Reason about Pairwise Differences
Dylan Sam, Devin Willmott, Joao D. Semedo, J. Zico Kolter

TL;DR
This paper introduces a finetuning method for CLIP that enhances its ability to reason about differences between images, improving ranking, zero-shot classification, and enabling comparative inference by leveraging difference-based embeddings.
Contribution
The authors propose a novel contrastive finetuning approach that trains CLIP to understand and utilize differences in image embeddings, enhancing its reasoning and classification capabilities.
Findings
Improved image ranking by attributes such as size.
Enhanced zero-shot classification performance.
Embeddings exhibit more geometric properties like analogies.
Abstract
Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is that the resulting embedding space seems to lack some of the structure of its purely text-based alternatives. For instance, while text embeddings have long been noted to satisfy analogies in embedding space using vector arithmetic, CLIP has no such property. In this paper, we propose an approach to natively train CLIP in a contrastive manner to reason about differences in embedding space. We finetune CLIP so that text descriptions of differences between images correspond to their difference in image embedding space, using synthetically generated data with large language models on image-caption paired datasets. We first demonstrate that our approach…
Peer Reviews
Decision·Submitted to ICLR 2025
The strengths of the paper are as follows: * The proposed method is simple and can be easily extended as a main objective during CLIP pretraining (not just as a finetuning objective). * The proposed method shows significant improvement on tasks involving difference based classification.
The weaknesses of the paper are: * The main contribution of the paper is limited. Finetuning CLIP on the proposed objective (aligning g(I_1 ) - g(I_2) with f(T_{1,2})) is a novel contribution, however in itself is not sufficient. Similarly, the empirical studies in this paper are not very impressive. While Table 1 results on difference-based classification are impressive, the rest of the results show very marginal improvements over the baselines. * There seems to be lot of repetition in the fo
- The paper proposes an effective method to solve the highlighted problem. Specifically, the synthetic data generation using LLMs for any pair of images is reasonable. - The paper showcases the utility of their method using difference-based classification, zero-shot evaluation, and evaluating the quality of the learned method.
- The paper lacks a good motivation on why CLIP models should exhibit the structure of purely-language based text embeddings. The CLIP pretraining never encourages such structure to emerge so it is not unusual to observe such behavior. In particular, the motivation never justifies why difference understanding is a desirable property. - The paper’s problem does not seem relevant in the context of large multimodal models. Specifically, we have several models such as LLaVA, Qwen-VL that should be a
- New capabilities: The method enables valuable new capabilities like difference-based classification and comparative prompting while maintaining or improving CLIP's core zero-shot performance. - Comprehensive empirical validation: The work provides extensive experimental validation across multiple tasks, datasets, and evaluation metrics, including classification, embedding analysis, and generation. The baselines are also very reasonable (fine-tune CLIP on the COCO original and rewrite captions
- Scaling up: The experiment focuses only on a single CLIP model size (ViT-L/14), it is unclear whether the approach still remains useful when the model is larger, and whether it is more or less useful. - Limited improvements: While the method shows consistent improvements on standard classification tasks, many of the gains are relatively small (1-2% absolute improvement). It is unclear whether the cost is worth such small improvements. - LLM dependency: The method's core dependency on LLM-gen
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsContrastive Learning · Contrastive Language-Image Pre-training
