Leveraging Large Vision-Language Model as User Intent-aware Encoder for Composed Image Retrieval
Zelong Sun, Dong Jing, Guoxing Yang, Nanyi Fei, Zhiwu Lu

TL;DR
This paper introduces CIR-LVLM, a novel framework that uses large vision-language models as user intent-aware encoders to improve composed image retrieval by better understanding user intent and visual information, achieving state-of-the-art results.
Contribution
The paper proposes a new framework leveraging LVLMs with a hybrid intent instruction module for improved comprehension of user intent in CIR tasks.
Findings
Achieves state-of-the-art performance on three benchmarks.
Effectively captures user intent with hybrid instruction guidance.
Demonstrates the potential of LVLMs in CIR applications.
Abstract
Composed Image Retrieval (CIR) aims to retrieve target images from candidate set using a hybrid-modality query consisting of a reference image and a relative caption that describes the user intent. Recent studies attempt to utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing the task.However, these methods typically fail to simultaneously meet two key requirements of CIR: comprehensively extracting visual information and faithfully following the user intent. In this work, we propose CIR-LVLM, a novel framework that leverages the large vision-language model (LVLM) as the powerful user intent-aware encoder to better meet these requirements. Our motivation is to explore the advanced reasoning and instruction-following capabilities of LVLM for accurately understanding and responding the user intent. Furthermore, we design a novel hybrid intent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsSparse Evolutionary Training
