Renaissance: Investigating the Pretraining of Vision-Language Encoders

Clayton Fields; Casey Kennington

arXiv:2411.06657·cs.CV·February 26, 2026

Renaissance: Investigating the Pretraining of Vision-Language Encoders

Clayton Fields, Casey Kennington

PDF

Open Access 1 Repo

TL;DR

This paper introduces Renaissance, a flexible framework for pretraining vision-language encoders, demonstrating that freezing parts of models can save compute without sacrificing performance.

Contribution

It provides a comprehensive evaluation framework and insights into effective pretraining strategies for vision-language models, including model freezing and architecture choices.

Findings

01

Freezing large parts of VL models reduces compute with minimal performance loss.

02

Basing VL transformers on vision or text models affects training dynamics and outcomes.

03

Renaissance enables flexible creation, training, and evaluation of VL transformers.

Abstract

In the past several years there has been an explosion of available models for vision-language (VL) tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. Additionally, the limited programming tools available for modeling make conducting VL research more difficult than necessary. In this paper, we seek to answer several questions related to the pretraining of VL encoders through meta-analysis. To conduct these experiments, we introduce a VL evaluation framework called Renaissance. In our first set of experiments, we show that we can save significant compute at little to no cost to downstream performance, by freezing large parts of VL models during pretraining. In our second set of experiments, we examine the effect of basing a VL transformer on a vision model versus a text model. Renaissance offers a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bsu-slim/renaissance
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsSparse Evolutionary Training