FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data
Peng Yuan, Bingyin Mei, Hui Zhang

TL;DR
FashionMV introduces a large-scale multi-view fashion dataset and a novel product-level image retrieval framework that leverages multimodal large language models with multi-view reasoning capabilities.
Contribution
The paper presents the first multi-view fashion dataset and a new modeling framework that improves product-level image retrieval using multimodal large language models.
Findings
Alignment is the most critical mechanism for performance.
Two-stage dialogue architecture is essential for effective alignment.
Supervised fine-tuning and chain-of-thought are partially redundant for knowledge injection.
Abstract
Composed Image Retrieval (CIR) retrieves target images using a reference image paired with modification text. Despite rapid advances, all existing methods and datasets operate at the image level -- a single reference image plus modification text in, a single target image out -- while real e-commerce users reason about products shown from multiple viewpoints. We term this mismatch View Incompleteness and formally define a new Multi-View CIR task that generalizes standard CIR from image-level to product-level retrieval. To support this task, we construct FashionMV, the first large-scale multi-view fashion dataset for product-level CIR, comprising 127K products, 472K multi-view images, and over 220K CIR triplets, built through a fully automated pipeline leveraging large multimodal models. We further propose ProCIR (Product-level Composed Image Retrieval), a modeling framework built upon a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
