Enhancing Supervised Composed Image Retrieval via Reasoning-Augmented Representation Engineering
Jun Li, Hongjian Dou, Zhenyu Zhang, Kai Li, Shaoguo Liu, Tingting Gao

TL;DR
This paper introduces a novel framework, PMTFR, that enhances supervised Composed Image Retrieval by integrating reasoning-inspired representations and a pyramid matching approach, achieving superior results without additional training.
Contribution
It proposes a training-free refinement method using representation engineering and a Pyramid Patcher to improve visual understanding in CIR models.
Findings
Outperforms state-of-the-art on CIR benchmarks
Effective in supervised CIR without extra training
Enhances visual understanding through pyramid matching
Abstract
Composed Image Retrieval (CIR) presents a significant challenge as it requires jointly understanding a reference image and a modified textual instruction to find relevant target images. Some existing methods attempt to use a two-stage approach to further refine retrieval results. However, this often requires additional training of a ranking model. Despite the success of Chain-of-Thought (CoT) techniques in reducing training costs for language models, their application in CIR tasks remains limited -- compressing visual information into text or relying on elaborate prompt designs. Besides, existing works only utilize it for zero-shot CIR, as it is challenging to achieve satisfactory results in supervised CIR with a well-trained model. In this work, we proposed a framework that includes the Pyramid Matching Model with Training-Free Refinement (PMTFR) to address these challenges. Through a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
