Masked Vision-Language Transformer in Fashion

Ge-Peng Ji; Mingcheng Zhuge; Dehong Gao; Deng-Ping Fan; Christos; Sakaridis; Luc Van Gool

arXiv:2210.15110·cs.CV·June 6, 2023

Masked Vision-Language Transformer in Fashion

Ge-Peng Ji, Mingcheng Zhuge, Dehong Gao, Deng-Ping Fan, Christos, Sakaridis, Luc Van Gool

PDF

Open Access 1 Repo

TL;DR

This paper introduces MVLT, a novel end-to-end masked vision-language transformer tailored for fashion, enabling fine-grained multi-modal understanding and versatile task generalization without extra pre-processing models.

Contribution

It presents the first end-to-end fashion-specific vision-language transformer using vision transformers, with masked image reconstruction for detailed understanding and broad applicability.

Findings

01

Improved retrieval performance (rank@5: 17%)

02

Enhanced recognition accuracy (3%)

03

Effective multi-modal modeling without extra pre-processing

Abstract

We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gewelsji/mvlt
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Layer Normalization · Residual Connection · Dropout · Weight Decay · Adam · Softmax · Kaleido-BERT