Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

Marianna Nezhurina; Tomer Porian; Giovanni Pucceti; Tommie Kerssies; Romain Beaumont; Mehdi Cherti; Jenia Jitsev

arXiv:2506.04598·cs.LG·June 6, 2025

Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets

Marianna Nezhurina, Tomer Porian, Giovanni Pucceti, Tommie Kerssies, Romain Beaumont, Mehdi Cherti, Jenia Jitsev

PDF

Open Access 2 Repos 8 Models 1 Video

TL;DR

This paper develops scaling laws for language-vision models to enable systematic comparison of models and datasets, demonstrating MaMMUT's superior scaling and efficiency over CLIP across multiple tasks and datasets.

Contribution

It introduces the first comprehensive scaling laws for CLIP and MaMMUT models, allowing for accurate model and dataset comparison across scales and tasks.

Findings

01

MaMMUT shows stronger improvement with scale than CLIP.

02

Scaling laws are consistent across different downstream tasks.

03

Constant learning rate scaling laws reduce computational costs.

Abstract

In studies of transferable learning, scaling laws are obtained for various important foundation models to predict their properties and performance at larger scales. We show here how scaling law derivation can also be used for model and dataset comparison, allowing to decide which procedure is to be preferred for pre-training. For the first time, full scaling laws based on dense measurements across a wide span of model and samples seen scales are derived for two important language-vision learning procedures, CLIP and MaMMUT, that use either contrastive only or contrastive and captioning text generative loss. Ensuring sufficient prediction accuracy for held out points, we use derived scaling laws to compare both models, obtaining evidence for MaMMUT's stronger improvement with scale and better sample efficiency than standard CLIP. To strengthen validity of the comparison, we show scaling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Scaling Laws for Robust Comparison of Open Foundation Language-Vision Models and Datasets· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI

MethodsContrastive Language-Image Pre-training