General Image Descriptors for Open World Image Retrieval using ViT CLIP
Marcos V. Conde, Ivan Aerlic, Simon J\'egou

TL;DR
This paper presents a fine-tuning approach for zero-shot ViT models using CLIP to improve open-world image retrieval across diverse domains, demonstrated through the GUIE Challenge.
Contribution
The work introduces a set of techniques for effectively fine-tuning pre-trained ViT models for multi-domain image retrieval tasks.
Findings
Achieved 4th place in the GUIE Challenge.
Developed effective fine-tuning tricks for ViT CLIP models.
Enhanced performance in open-world image retrieval.
Abstract
The Google Universal Image Embedding (GUIE) Challenge is one of the first competitions in multi-domain image representations in the wild, covering a wide distribution of objects: landmarks, artwork, food, etc. This is a fundamental computer vision problem with notable applications in image retrieval, search engines and e-commerce. In this work, we explain our 4th place solution to the GUIE Challenge, and our "bag of tricks" to fine-tune zero-shot Vision Transformers (ViT) pre-trained using CLIP.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Medical Image Segmentation Techniques
MethodsCosine Annealing · Linear Layer · Residual Connection · Attention Dropout · Linear Warmup With Cosine Annealing · Dropout · Weight Decay · Byte Pair Encoding · Softmax · Dense Connections
