MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval

Rong-Cheng Tu; Zhao Jin; Jingyi Liao; Xiao Luo; Yingjie Wang; Li Shen; Dacheng Tao

arXiv:2505.19707·cs.CV·May 27, 2025

MLLM-Guided VLM Fine-Tuning with Joint Inference for Zero-Shot Composed Image Retrieval

Rong-Cheng Tu, Zhao Jin, Jingyi Liao, Xiao Luo, Yingjie Wang, Li Shen, Dacheng Tao

PDF

Open Access

TL;DR

This paper introduces MVFT-JI, a novel fine-tuning method for zero-shot composed image retrieval that leverages a pretrained multimodal large language model to improve semantic understanding and retrieval accuracy.

Contribution

The paper proposes a joint training approach using MLLM-guided tasks to enhance VLM's compositional retrieval capabilities without requiring labeled data.

Findings

01

Improved retrieval performance on complex visual transformations.

02

Effective use of unlabeled images for training.

03

Enhanced semantic alignment between queries and images.

Abstract

Existing Zero-Shot Composed Image Retrieval (ZS-CIR) methods typically train adapters that convert reference images into pseudo-text tokens, which are concatenated with the modifying text and processed by frozen text encoders in pretrained VLMs or LLMs. While this design leverages the strengths of large pretrained models, it only supervises the adapter to produce encoder-compatible tokens that loosely preserve visual semantics. Crucially, it does not directly optimize the composed query representation to capture the full intent of the composition or to align with the target semantics, thereby limiting retrieval performance, particularly in cases involving fine-grained or complex visual transformations. To address this problem, we propose MLLM-Guided VLM Fine-Tuning with Joint Inference (MVFT-JI), a novel approach that leverages a pretrained multimodal large language model (MLLM) to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Enhancement Techniques · Advanced Image Fusion Techniques

MethodsAdapter · ALIGN