Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework   of Vision-and-Language BERTs

Emanuele Bugliarello; Ryan Cotterell; Naoaki Okazaki; Desmond Elliott

arXiv:2011.15124·cs.CL·June 1, 2021

Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, Desmond Elliott

PDF

3 Repos

TL;DR

This paper provides a comprehensive analysis of vision-and-language BERT models, unifying different architectures under a single framework and empirically examining their differences to understand the impact of training data, hyperparameters, and embedding layers.

Contribution

It introduces a unified theoretical framework for vision-and-language BERTs and empirically investigates the factors influencing their performance.

Findings

01

Training data and hyperparameters significantly affect model performance.

02

Embedding layer plays a crucial role in V&L BERTs.

03

Differences between models are largely due to training setup, not architecture.

Abstract

Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorised into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five V&L BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.