Masked Vision-Language Transformer in Fashion
Ge-Peng Ji, Mingcheng Zhuge, Dehong Gao, Deng-Ping Fan, Christos, Sakaridis, Luc Van Gool

TL;DR
This paper introduces MVLT, a novel end-to-end masked vision-language transformer tailored for fashion, enabling fine-grained multi-modal understanding and versatile task generalization without extra pre-processing models.
Contribution
It presents the first end-to-end fashion-specific vision-language transformer using vision transformers, with masked image reconstruction for detailed understanding and broad applicability.
Findings
Improved retrieval performance (rank@5: 17%)
Enhanced recognition accuracy (3%)
Effective multi-modal modeling without extra pre-processing
Abstract
We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Layer Normalization · Residual Connection · Dropout · Weight Decay · Adam · Softmax · Kaleido-BERT
