An Empirical Study of Training End-to-End Vision-and-Language Transformers
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan, Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu,, Michael Zeng

TL;DR
This paper investigates how to design and pre-train fully transformer-based vision-and-language models end-to-end, achieving state-of-the-art results on VQA tasks through comprehensive experiments and insights.
Contribution
It introduces METER, a framework for end-to-end training of VL transformers, and provides detailed analysis of model components and training strategies for improved performance.
Findings
METER achieves 77.64% accuracy on VQAv2 with 4M images pre-training.
It surpasses previous region-feature-based models by 1.04%.
Further scaling improves accuracy to 80.54%.
Abstract
Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Layer Normalization · Residual Connection · Dense Connections · Attention Dropout · Softmax
