CLIP-Branches: Interactive Fine-Tuning for Text-Image Retrieval
Christian L\"ulf, Denis Mayr Lima Martins, Marcos Antonio Vaz Salles,, Yongluan Zhou, Fabian Gieseke

TL;DR
CLIP-Branches introduces an interactive fine-tuning method for text-image retrieval that improves search relevance by incorporating user feedback, leveraging efficient indexing to maintain fast response times.
Contribution
It presents a novel interactive fine-tuning approach for CLIP-based search engines, enhancing accuracy without sacrificing speed through efficient indexing.
Findings
Improved relevance and accuracy in search results after fine-tuning
Maintains swift response times with efficient index structures
Enhances traditional CLIP-based retrieval with user-guided refinement
Abstract
The advent of text-image models, most notably CLIP, has significantly transformed the landscape of information retrieval. These models enable the fusion of various modalities, such as text and images. One significant outcome of CLIP is its capability to allow users to search for images using text as a query, as well as vice versa. This is achieved via a joint embedding of images and text data that can, for instance, be used to search for similar items. Despite efficient query processing techniques such as approximate nearest neighbor search, the results may lack precision and completeness. We introduce CLIP-Branches, a novel text-image search engine built upon the CLIP architecture. Our approach enhances traditional text-image search engines by incorporating an interactive fine-tuning phase, which allows the user to further concretize the search query by iteratively defining positive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
