CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space
Sohwi Lim, Lee Hyoseok, Jungjoon Park, Tae-Hyun Oh

TL;DR
CLAY introduces a method to adaptively modulate visual similarity in pretrained vision-language models, enabling multi-conditioned image retrieval without retraining.
Contribution
It redefines the embedding space as text-conditional, allowing flexible, efficient retrieval with fixed visual features and introduces a synthetic evaluation dataset.
Findings
CLAY achieves high retrieval accuracy on standard datasets.
CLAY demonstrates notable computational efficiency.
The method supports multiple simultaneous conditions in retrieval.
Abstract
Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
