FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language   Pre-training

Jiale Huang; Dehong Gao; Jinxia Zhang; Zechao Zhan; Yang Hu; Xin Wang

arXiv:2412.19997·cs.CV·January 14, 2025

FashionFAE: Fine-grained Attributes Enhanced Fashion Vision-Language Pre-training

Jiale Huang, Dehong Gao, Jinxia Zhang, Zechao Zhan, Yang Hu, Xin Wang

PDF

Open Access

TL;DR

FashionFAE introduces a novel vision-language pre-training approach emphasizing fine-grained attributes like texture and material, significantly improving fashion item retrieval and recognition tasks.

Contribution

It proposes attribute-focused text prediction and image reconstruction tasks to enhance fine-grained understanding in fashion vision-language models.

Findings

01

Achieves 2.9% and 5.2% improvements in retrieval accuracy.

02

Attains 1.6% average improvement in recognition tasks.

03

Outperforms state-of-the-art methods on fashion datasets.

Abstract

Large-scale Vision-Language Pre-training (VLP) has demonstrated remarkable success in the general domain. However, in the fashion domain, items are distinguished by fine-grained attributes like texture and material, which are crucial for tasks such as retrieval. Existing models often fail to leverage these fine-grained attributes from both text and image modalities. To address the above issues, we propose a novel approach for the fashion domain, Fine-grained Attributes Enhanced VLP (FashionFAE), which focuses on the detailed characteristics of fashion data. An attribute-emphasized text prediction task is proposed to predict fine-grained attributes of the items. This forces the model to focus on the salient attributes from the text modality. Additionally, a novel attribute-promoted image reconstruction task is proposed, which further enhances the fine-grained ability of the model by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus