FOR: Finetuning for Object Level Open Vocabulary Image Retrieval
Hila Levi, Guy Heller, Dan Levi

TL;DR
This paper introduces FOR, a finetuning method for object-level open vocabulary image retrieval that enhances accuracy by adapting pre-trained CLIP models with specialized decoders and multi-objective training, outperforming state-of-the-art methods.
Contribution
The paper presents a novel finetuning approach for CLIP models that improves open vocabulary image retrieval accuracy by using a specialized decoder and multi-objective training.
Findings
Up to 8 mAP@50 improvement over SOTA
Effective semi-supervised performance with limited labels
Significant accuracy gains across three datasets
Abstract
As working with large datasets becomes standard, the task of accurately retrieving images containing objects of interest by an open set textual query gains practical importance. The current leading approach utilizes a pre-trained CLIP model without any adaptation to the target domain, balancing accuracy and efficiency through additional post-processing. In this work, we propose FOR: Finetuning for Object-centric Open-vocabulary Image Retrieval, which allows finetuning on a target dataset using closed-set labels while keeping the visual-language association crucial for open vocabulary retrieval. FOR is based on two design elements: a specialized decoder variant of the CLIP head customized for the intended task, and its coupling within a multi-objective training framework. Together, these design choices result in a significant increase in accuracy, showcasing improvements of up to 8…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
