Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval
Tong Wang, Yunhan Zhao, Shu Kong

TL;DR
This paper introduces Paracosm, a training-free zero-shot method for composed image retrieval that generates a 'mental image' directly from multimodal queries to improve matching accuracy, outperforming existing methods.
Contribution
Proposes a novel zero-shot CIR approach that directly generates 'mental images' from multimodal queries, bypassing traditional textual descriptions and domain gaps.
Findings
Achieves state-of-the-art zero-shot CIR performance.
Outperforms existing methods on challenging benchmarks.
Demonstrates effectiveness of direct mental image generation.
Abstract
Composed Image Retrieval (CIR) is the task of retrieving a target image from a database using a multimodal query, which consists of a reference image and a modification text. The text specifies how to alter the reference image to form a ''mental image'', based on which CIR should find the target image in the database. The fundamental challenge of CIR is that this ''mental image'' is not physically available and is only implicitly defined by the query. The contemporary literature pursues zero-shot methods and uses a Large Multimodal Model (LMM) to generate a textual description for a given multimodal query, and then employs a Vision-Language Model (VLM) for textual-visual matching to search for the target image. In contrast, we address CIR from first principles by directly generating the ''mental image'' for more accurate matching. Particularly, we prompt an LMM to generate a ''mental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
