TL;DR
G-MIXER is a training-free method that enhances zero-shot composed image retrieval by generating diverse candidate features through geodesic mixup and re-ranking with explicit semantics, achieving state-of-the-art results.
Contribution
It introduces a novel geodesic mixup technique for implicit semantic expansion and a re-ranking strategy using explicit semantics, all without additional training.
Findings
Achieves state-of-the-art performance on multiple ZS-CIR benchmarks.
Effectively balances retrieval diversity and accuracy.
Does not require additional training or fine-tuning.
Abstract
Composed Image Retrieval (CIR) aims to retrieve target images by integrating a reference image with a corresponding modification text. CIR requires jointly considering the explicit semantics specified in the query and the implicit semantics embedded within its bi-modal composition. Recent training-free Zero-Shot CIR (ZS-CIR) methods leverage Multimodal Large Language Models (MLLMs) to generate detailed target descriptions, converting the implicit information into explicit textual expressions. However, these methods rely heavily on the textual modality and fail to capture the fuzzy retrieval nature that requires considering diverse combinations of candidates. This leads to reduced diversity and accuracy in retrieval results. To address this limitation, we propose a novel training-free method, Geodesic Mixup-based Implicit semantic eXpansion and Explicit semantic Re-ranking for ZS-CIR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
