Pretraining Frequency Predicts Compositional Generalization of CLIP on Real-World Tasks
Thadd\"aus Wiedemer, Yash Sharma, Ameya Prabhu, Matthias Bethge,, Wieland Brendel

TL;DR
This paper shows that CLIP's ability to generalize compositionally in real-world tasks depends on pretraining object frequencies, and that performance can be predicted from these frequencies, informing better data curation.
Contribution
It demonstrates that CLIP's compositional generalization correlates with pretraining object frequencies and introduces a method to predict performance based on these frequencies.
Findings
CLIP's performance on novel object combinations can be predicted from pretraining frequencies.
CLIP learns to disentangle and recompose objects observed during pretraining.
Balancing object occurrences in training data improves generalization.
Abstract
We investigate the success conditions for compositional generalization of CLIP models on real-world data through performance prediction. Prior work shows that CLIP requires exponentially more pretraining data for linear performance gains on individual concepts. This sample-inefficient scaling could be mitigated if CLIP systematically understood new inputs as compositions of learned components, allowing rare observation to be mapped to common concepts. To explore CLIP's compositional generalization ability, we filter retrieval corpora for samples with object combinations not present in the pretraining corpus. We show that CLIP's performance on these samples can be accurately predicted from the pretraining frequencies of individual objects. Our findings demonstrate that CLIP learns to disentangle objects observed in its pretraining data and can recompose them straightforwardly.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFuzzy Logic and Control Systems · Advanced Chemical Sensor Technologies · Neural dynamics and brain function
MethodsContrastive Language-Image Pre-training
