Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models
Masayuki Kawarada, Kosuke Yamada, Antonio Tejero-de-Pablos, Naoto Inoue

TL;DR
DIOR is a training-free framework that uses large vision-language models to generate conditional image embeddings based on textual conditions, outperforming existing methods without additional training.
Contribution
We introduce DIOR, a novel training-free approach leveraging LVLMs to produce conditional image embeddings from single-word prompts, eliminating the need for task-specific training.
Findings
DIOR outperforms existing training-free baselines in conditional image similarity tasks.
DIOR surpasses methods requiring additional training across various settings.
The approach is versatile and applicable to any image and condition without extra training.
Abstract
Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
