Meta CLIP 2: A Worldwide Scaling Recipe

Yung-Sung Chuang; Yang Li; Dong Wang; Ching-Feng Yeh; Kehan Lyu; Ramya Raghavendra; James Glass; Lifei Huang; Jason Weston; Luke Zettlemoyer; Xinlei Chen; Zhuang Liu; Saining Xie; Wen-tau Yih; Shang-Wen Li; Hu Xu

arXiv:2507.22062·cs.CV·August 4, 2025

Meta CLIP 2: A Worldwide Scaling Recipe

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu

PDF

Open Access 10 Models 1 Video

TL;DR

Meta CLIP 2 introduces a scalable training recipe for CLIP models on worldwide web data, effectively handling multilingual data and improving performance across diverse benchmarks without specialized architecture changes.

Contribution

It presents the first comprehensive recipe for training CLIP from scratch on global web-scale data, addressing multilingual challenges and surpassing previous models in zero-shot tasks.

Findings

01

Outperforms English-only CLIP in zero-shot ImageNet classification by 0.8%.

02

Achieves new state-of-the-art on multilingual benchmarks like CVQA and Babel-ImageNet.

03

Demonstrates effective handling of non-English data without architecture modifications.

Abstract

Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Meta CLIP 2: A Worldwide Scaling Recipe· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling