Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval
Mingyu Liu, Sihan Huang, Yijia Fan, Yinlin Yan, Quan Zhang, Jian-Fang Hu, Jianhuang Lai

TL;DR
This paper introduces DeCIR, a method that decouples endpoint and semantic transition learning in zero-shot composed image retrieval, significantly improving performance without added inference complexity.
Contribution
DeCIR proposes a novel decoupling approach with separate training of endpoint and transition modules, addressing semantic transition bottlenecks in projection-based ZS-CIR.
Findings
DeCIR consistently outperforms existing projection-based ZS-CIR methods.
The method improves retrieval accuracy on multiple datasets.
DeCIR maintains low inference complexity while enhancing performance.
Abstract
Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint--transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
