STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
Miaoge Li, Dongsheng Wang, Zening Sun, Jinsen Zhang, Wenhan Luo, Jingcai Guo

TL;DR
This paper introduces STiTch, a training-free zero-shot image retrieval framework that refines captions via semantic transition and models set-to-set alignment for improved multimodal retrieval accuracy.
Contribution
The paper proposes a novel framework combining semantic transition and transportation in embedding space, enhancing zero-shot composed image retrieval without training.
Findings
Effective in refining captions to focus on core modifications.
Models set-to-set alignment as a distribution matching problem.
Achieves superior performance on multiple CIR benchmarks.
Abstract
Training-free zero-shot composed image retrieval models are recently gaining increasing research interest due to their generalizability and flexibility in unseen multimodal retrieval. Recent LLM-based advances focus on generating the expected target caption by exploring the compositional ability behind the LLMs. Although efficient, we find that 1) the generated captions tend to introduce unexpected features from the reference image due to the semantic gap between the input image and text modification, where the image contains much more details than the text; 2) the point-to-point alignment during the retrieval stage fails to capture diverse compositions. To address these challenges, we introduce a novel Semantic Transition and Transportation in collaboration framework for training-free zero-shot CIR tasks. Specifically, given the composed caption inferred by an LLM, we aim to refine it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
