FACap: A Large-scale Fashion Dataset for Fine-grained Composed Image Retrieval
Fran\c{c}ois Gard\`eres, Shizhe Chen, Camille-Sovanneary Gauthier, Jean Ponce

TL;DR
FACap introduces a large-scale fashion dataset and a specialized retrieval model, significantly enhancing fine-grained fashion image retrieval performance, especially for e-commerce applications.
Contribution
The paper presents FACap, a novel large-scale fashion dataset created with automated annotation, and FashionBLIP-2, a fine-tuned model that improves fashion-specific composed image retrieval.
Findings
Enhanced retrieval accuracy on Fashion IQ and enhFashionIQ datasets.
Significant performance gains for fine-grained modification retrieval.
Demonstrated effectiveness of dataset and model in e-commerce scenarios.
Abstract
The composed image retrieval (CIR) task is to retrieve target images given a reference image and a modification text. Recent methods for CIR leverage large pretrained vision-language models (VLMs) and achieve good performance on general-domain concepts like color and texture. However, they still struggle with application domains like fashion, because the rich and diverse vocabulary used in fashion requires specific fine-grained vision and language understanding. An additional difficulty is the lack of large-scale fashion datasets with detailed and relevant annotations, due to the expensive cost of manual annotation by specialists. To address these challenges, we introduce FACap, a large-scale, automatically constructed fashion-domain CIR dataset. It leverages web-sourced fashion images and a two-stage annotation pipeline powered by a VLM and a large language model (LLM) to generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
