TL;DR
This paper provides a comprehensive analysis of vision-and-language BERT models, unifying different architectures under a single framework and empirically examining their differences to understand the impact of training data, hyperparameters, and embedding layers.
Contribution
It introduces a unified theoretical framework for vision-and-language BERTs and empirically investigates the factors influencing their performance.
Findings
Training data and hyperparameters significantly affect model performance.
Embedding layer plays a crucial role in V&L BERTs.
Differences between models are largely due to training setup, not architecture.
Abstract
Large-scale pretraining and task-specific fine-tuning is now the standard methodology for many tasks in computer vision and natural language processing. Recently, a multitude of methods have been proposed for pretraining vision and language BERTs to tackle challenges at the intersection of these two key areas of AI. These models can be categorised into either single-stream or dual-stream encoders. We study the differences between these two categories, and show how they can be unified under a single theoretical framework. We then conduct controlled experiments to discern the empirical differences between five V&L BERTs. Our experiments show that training data and hyperparameters are responsible for most of the differences between the reported results, but they also reveal that the embedding layer plays a crucial role in these massive models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
