FOR: Finetuning for Object Level Open Vocabulary Image Retrieval

Hila Levi; Guy Heller; Dan Levi

arXiv:2412.18806·cs.CV·December 30, 2024

FOR: Finetuning for Object Level Open Vocabulary Image Retrieval

Hila Levi, Guy Heller, Dan Levi

PDF

Open Access

TL;DR

This paper introduces FOR, a finetuning method for object-level open vocabulary image retrieval that enhances accuracy by adapting pre-trained CLIP models with specialized decoders and multi-objective training, outperforming state-of-the-art methods.

Contribution

The paper presents a novel finetuning approach for CLIP models that improves open vocabulary image retrieval accuracy by using a specialized decoder and multi-objective training.

Findings

01

Up to 8 mAP@50 improvement over SOTA

02

Effective semi-supervised performance with limited labels

03

Significant accuracy gains across three datasets

Abstract

As working with large datasets becomes standard, the task of accurately retrieving images containing objects of interest by an open set textual query gains practical importance. The current leading approach utilizes a pre-trained CLIP model without any adaptation to the target domain, balancing accuracy and efficiency through additional post-processing. In this work, we propose FOR: Finetuning for Object-centric Open-vocabulary Image Retrieval, which allows finetuning on a target dataset using closed-set labels while keeping the visual-language association crucial for open vocabulary retrieval. FOR is based on two design elements: a specialized decoder variant of the CLIP head customized for the intended task, and its coupling within a multi-objective training framework. Together, these design choices result in a significant increase in accuracy, showcasing improvements of up to 8…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training