An Empirical Study of Training End-to-End Vision-and-Language   Transformers

Zi-Yi Dou; Yichong Xu; Zhe Gan; Jianfeng Wang; Shuohang Wang; Lijuan; Wang; Chenguang Zhu; Pengchuan Zhang; Lu Yuan; Nanyun Peng; Zicheng Liu,; Michael Zeng

arXiv:2111.02387·cs.CV·March 21, 2022·22 cites

An Empirical Study of Training End-to-End Vision-and-Language Transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan, Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu,, Michael Zeng

PDF

Open Access 3 Repos

TL;DR

This paper investigates how to design and pre-train fully transformer-based vision-and-language models end-to-end, achieving state-of-the-art results on VQA tasks through comprehensive experiments and insights.

Contribution

It introduces METER, a framework for end-to-end training of VL transformers, and provides detailed analysis of model components and training strategies for improved performance.

Findings

01

METER achieves 77.64% accuracy on VQAv2 with 4M images pre-training.

02

It surpasses previous region-feature-based models by 1.04%.

03

Further scaling improves accuracy to 80.54%.

Abstract

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Layer Normalization · Residual Connection · Dense Connections · Attention Dropout · Softmax