Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval

Mingyu Liu; Sihan Huang; Yijia Fan; Yinlin Yan; Quan Zhang; Jian-Fang Hu; Jianhuang Lai

arXiv:2605.08389·cs.CV·May 22, 2026

Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval

Mingyu Liu, Sihan Huang, Yijia Fan, Yinlin Yan, Quan Zhang, Jian-Fang Hu, Jianhuang Lai

PDF

TL;DR

This paper introduces DeCIR, a method that decouples endpoint and semantic transition learning in zero-shot composed image retrieval, significantly improving performance without added inference complexity.

Contribution

DeCIR proposes a novel decoupling approach with separate training of endpoint and transition modules, addressing semantic transition bottlenecks in projection-based ZS-CIR.

Findings

01

DeCIR consistently outperforms existing projection-based ZS-CIR methods.

02

The method improves retrieval accuracy on multiple datasets.

03

DeCIR maintains low inference complexity while enhancing performance.

Abstract

Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint--transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.