FACap: A Large-scale Fashion Dataset for Fine-grained Composed Image Retrieval

Fran\c{c}ois Gard\`eres; Shizhe Chen; Camille-Sovanneary Gauthier; Jean Ponce

arXiv:2507.07135·cs.LG·July 11, 2025

FACap: A Large-scale Fashion Dataset for Fine-grained Composed Image Retrieval

Fran\c{c}ois Gard\`eres, Shizhe Chen, Camille-Sovanneary Gauthier, Jean Ponce

PDF

Open Access

TL;DR

FACap introduces a large-scale fashion dataset and a specialized retrieval model, significantly enhancing fine-grained fashion image retrieval performance, especially for e-commerce applications.

Contribution

The paper presents FACap, a novel large-scale fashion dataset created with automated annotation, and FashionBLIP-2, a fine-tuned model that improves fashion-specific composed image retrieval.

Findings

01

Enhanced retrieval accuracy on Fashion IQ and enhFashionIQ datasets.

02

Significant performance gains for fine-grained modification retrieval.

03

Demonstrated effectiveness of dataset and model in e-commerce scenarios.

Abstract

The composed image retrieval (CIR) task is to retrieve target images given a reference image and a modification text. Recent methods for CIR leverage large pretrained vision-language models (VLMs) and achieve good performance on general-domain concepts like color and texture. However, they still struggle with application domains like fashion, because the rich and diverse vocabulary used in fashion requires specific fine-grained vision and language understanding. An additional difficulty is the lack of large-scale fashion datasets with detailed and relevant annotations, due to the expensive cost of manual annotation by specialists. To address these challenges, we introduce FACap, a large-scale, automatically constructed fashion-domain CIR dataset. It leverages web-sourced fashion images and a two-stage annotation pipeline powered by a VLM and a large language model (LLM) to generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques