Meta CLIP 2: A Worldwide Scaling Recipe
Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, Xinlei Chen, Zhuang Liu, Saining Xie, Wen-tau Yih, Shang-Wen Li, Hu Xu

TL;DR
Meta CLIP 2 introduces a scalable training recipe for CLIP models on worldwide web data, effectively handling multilingual data and improving performance across diverse benchmarks without specialized architecture changes.
Contribution
It presents the first comprehensive recipe for training CLIP from scratch on global web-scale data, addressing multilingual challenges and surpassing previous models in zero-shot tasks.
Findings
Outperforms English-only CLIP in zero-shot ImageNet classification by 0.8%.
Achieves new state-of-the-art on multilingual benchmarks like CVQA and Babel-ImageNet.
Demonstrates effective handling of non-English data without architecture modifications.
Abstract
Contrastive Language-Image Pretraining (CLIP) is a popular foundation model, supporting from zero-shot classification, retrieval to encoders for multimodal large language models (MLLMs). Although CLIP is successfully trained on billion-scale image-text pairs from the English world, scaling CLIP's training further to learning from the worldwide web data is still challenging: (1) no curation method is available to handle data points from non-English world; (2) the English performance from existing multilingual CLIP is worse than its English-only counterpart, i.e., "curse of multilinguality" that is common in LLMs. Here, we present Meta CLIP 2, the first recipe training CLIP from scratch on worldwide web-scale image-text pairs. To generalize our findings, we conduct rigorous ablations with minimal changes that are necessary to address the above challenges and present a recipe enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/metaclip-2-worldwide-giant-378model· 8.8k dl· ♡ 128.8k dl♡ 12
- 🤗facebook/metaclip-2-worldwide-huge-quickgelumodel· 48k dl· ♡ 1748k dl♡ 17
- 🤗facebook/metaclip-2-worldwide-huge-378model· 591 dl· ♡ 6591 dl♡ 6
- 🤗timm/vit_gigantic_patch14_clip_378.metaclip2_worldwidemodel· 144 dl· ♡ 2144 dl♡ 2
- 🤗timm/vit_huge_patch14_clip_378.metaclip2_worldwidemodel· 190 dl· ♡ 1190 dl♡ 1
- 🤗timm/vit_huge_patch14_clip_224.metaclip2_worldwidemodel· 361 dl· ♡ 1361 dl♡ 1
- 🤗timm/vit_gigantic_patch14_clip_224.metaclip2_worldwidemodel· 96 dl· ♡ 196 dl♡ 1
- 🤗facebook/metaclip-2-worldwide-giantmodel· 555 dl· ♡ 7555 dl♡ 7
- 🤗onnx-community/metaclip-2-worldwide-huge-378-ONNXmodel· 5 dl5 dl
- 🤗facebook/metaclip-2-mt5-worldwide-b32model· 2.9k dl· ♡ 62.9k dl♡ 6
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
