Renaissance: Investigating the Pretraining of Vision-Language Encoders
Clayton Fields, Casey Kennington

TL;DR
This paper introduces Renaissance, a flexible framework for pretraining vision-language encoders, demonstrating that freezing parts of models can save compute without sacrificing performance.
Contribution
It provides a comprehensive evaluation framework and insights into effective pretraining strategies for vision-language models, including model freezing and architecture choices.
Findings
Freezing large parts of VL models reduces compute with minimal performance loss.
Basing VL transformers on vision or text models affects training dynamics and outcomes.
Renaissance enables flexible creation, training, and evaluation of VL transformers.
Abstract
In the past several years there has been an explosion of available models for vision-language (VL) tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. Additionally, the limited programming tools available for modeling make conducting VL research more difficult than necessary. In this paper, we seek to answer several questions related to the pretraining of VL encoders through meta-analysis. To conduct these experiments, we introduce a VL evaluation framework called Renaissance. In our first set of experiments, we show that we can save significant compute at little to no cost to downstream performance, by freezing large parts of VL models during pretraining. In our second set of experiments, we examine the effect of basing a VL transformer on a vision model versus a text model. Renaissance offers a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
MethodsSparse Evolutionary Training
