Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

Masayuki Kawarada; Kosuke Yamada; Antonio Tejero-de-Pablos; Naoto Inoue

arXiv:2512.21860·cs.CV·December 29, 2025

Training-free Conditional Image Embedding Framework Leveraging Large Vision Language Models

Masayuki Kawarada, Kosuke Yamada, Antonio Tejero-de-Pablos, Naoto Inoue

PDF

Open Access

TL;DR

DIOR is a training-free framework that uses large vision-language models to generate conditional image embeddings based on textual conditions, outperforming existing methods without additional training.

Contribution

We introduce DIOR, a novel training-free approach leveraging LVLMs to produce conditional image embeddings from single-word prompts, eliminating the need for task-specific training.

Findings

01

DIOR outperforms existing training-free baselines in conditional image similarity tasks.

02

DIOR surpasses methods requiring additional training across various settings.

03

The approach is versatile and applicable to any image and condition without extra training.

Abstract

Conditional image embeddings are feature representations that focus on specific aspects of an image indicated by a given textual condition (e.g., color, genre), which has been a challenging problem. Although recent vision foundation models, such as CLIP, offer rich representations of images, they are not designed to focus on a specified condition. In this paper, we propose DIOR, a method that leverages a large vision-language model (LVLM) to generate conditional image embeddings. DIOR is a training-free approach that prompts the LVLM to describe an image with a single word related to a given condition. The hidden state vector of the LVLM's last token is then extracted as the conditional image embedding. DIOR provides a versatile solution that can be applied to any image and condition without additional training or task-specific priors. Comprehensive experimental results on conditional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning