Are we pretraining it right? Digging deeper into visio-linguistic pretraining
Amanpreet Singh, Vedanuj Goswami, Devi Parikh

TL;DR
This paper investigates how the choice of pretraining datasets impacts the performance of visio-linguistic models, revealing that domain similarity and data generation methods significantly influence downstream task results.
Contribution
It systematically studies the effects of dataset domain similarity and data generation on pretraining effectiveness, providing insights for better dataset selection.
Findings
Automatically generated data closer to downstream domain improves performance.
Some reasonable dataset choices are ineffective for certain tasks.
Simple pretraining design choices can achieve near state-of-the-art results.
Abstract
Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective function design choices have received attention, the choice of pretraining datasets has received little attention. In this work, we question some of the default choices made in literature. For instance, we systematically study how varying similarity between the pretraining dataset domain (textual and visual) and the downstream domain affects performance. Surprisingly, we show that automatically generated data in a domain closer to the downstream task (e.g., VQA v2) is a better choice for pretraining than "natural" data but of a slightly different domain (e.g., Conceptual Captions). On the other hand, some seemingly reasonable choices of pretraining datasets were found to be entirely ineffective for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
