Embedding Arithmetic of Multimodal Queries for Image Retrieval
Guillaume Couairon, Matthieu Cord, Matthijs Douze, Holger Schwenk

TL;DR
This paper investigates the geometric properties of multimodal embeddings for image retrieval using text transformations, introduces the SIMAT dataset for evaluation, and demonstrates that finetuning improves transformation capabilities.
Contribution
It introduces the SIMAT dataset for evaluating multimodal image retrieval with text transformations and analyzes how finetuning enhances embedding space properties.
Findings
Vanilla CLIP embeddings are limited in applying delta vectors for transformations.
Finetuning on COCO dataset significantly improves transformation performance.
Pretrained sentence encoders have varying impacts on embedding quality.
Abstract
Latent text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man. Such structured semantic relations were not demonstrated on image representations. Recent works aiming at bridging this semantic gap embed images and text into a multimodal space, enabling the transfer of text-defined transformations to the image modality. We introduce the SIMAT dataset to evaluate the task of Image Retrieval with Multimodal queries. SIMAT contains 6k images and 18k textual transformation queries that aim at either replacing scene elements or changing pairwise relationships between scene elements. The goal is to retrieve an image consistent with the (source image, text transformation) query. We use an image/text matching oracle (OSCAR) to assess whether the image transformation is successful. The SIMAT dataset will be publicly available. We use…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsContrastive Language-Image Pre-training
