Leveraging Large Vision-Language Model as User Intent-aware Encoder for   Composed Image Retrieval

Zelong Sun; Dong Jing; Guoxing Yang; Nanyi Fei; Zhiwu Lu

arXiv:2412.11087·cs.IR·December 17, 2024

Leveraging Large Vision-Language Model as User Intent-aware Encoder for Composed Image Retrieval

Zelong Sun, Dong Jing, Guoxing Yang, Nanyi Fei, Zhiwu Lu

PDF

Open Access 1 Video

TL;DR

This paper introduces CIR-LVLM, a novel framework that uses large vision-language models as user intent-aware encoders to improve composed image retrieval by better understanding user intent and visual information, achieving state-of-the-art results.

Contribution

The paper proposes a new framework leveraging LVLMs with a hybrid intent instruction module for improved comprehension of user intent in CIR tasks.

Findings

01

Achieves state-of-the-art performance on three benchmarks.

02

Effectively captures user intent with hybrid instruction guidance.

03

Demonstrates the potential of LVLMs in CIR applications.

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images from candidate set using a hybrid-modality query consisting of a reference image and a relative caption that describes the user intent. Recent studies attempt to utilize Vision-Language Pre-training Models (VLPMs) with various fusion strategies for addressing the task.However, these methods typically fail to simultaneously meet two key requirements of CIR: comprehensively extracting visual information and faithfully following the user intent. In this work, we propose CIR-LVLM, a novel framework that leverages the large vision-language model (LVLM) as the powerful user intent-aware encoder to better meet these requirements. Our motivation is to explore the advanced reasoning and instruction-following capabilities of LVLM for accurately understanding and responding the user intent. Furthermore, we design a novel hybrid intent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Leveraging Large Vision-Language Model as User Intent-Aware Encoder for Composed Image Retrieval· underline

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques

MethodsSparse Evolutionary Training