What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights
Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi

TL;DR
This study investigates why CLIP pre-trained on web-scale datasets is more robust to data imbalance than supervised learning, revealing that its dynamic classification task and descriptive language supervision contribute to its generalizability.
Contribution
The paper provides controlled experiments uncovering mechanisms behind CLIP's robustness and offers transferable insights applicable to various learning paradigms.
Findings
CLIP's pretext task isolates bias from dominant classes.
Robustness improves with more descriptive language and larger data.
Models trained on imbalanced data can reach CLIP-level performance.
Abstract
Severe data imbalance naturally exists among web-scale vision-language datasets. Despite this, we find CLIP pre-trained thereupon exhibits notable robustness to the data imbalance compared to supervised learning, and demonstrates significant effectiveness in learning generalizable representations. With an aim to investigate the reasons behind this finding, we conduct controlled experiments to study various underlying factors, and reveal that CLIP's pretext task forms a dynamic classification problem wherein only a subset of classes is present in training. This isolates the bias from dominant classes and implicitly balances the learning signal. Furthermore, the robustness and discriminability of CLIP improve with more descriptive language supervision, larger data scale, and broader open-world concepts, which are inaccessible to supervised learning. Our study not only uncovers the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Imbalanced Data Classification Techniques · Data Mining Algorithms and Applications
MethodsContrastive Language-Image Pre-training
