Impact of Pretraining Word Co-occurrence on Compositional Generalization in Multimodal Models
Helen Qu, Sang Michael Xie

TL;DR
This paper investigates how word co-occurrence statistics in training data influence the compositional generalization of multimodal models like CLIP, revealing strong correlations between PMI and zero-shot accuracy across synthetic and natural images.
Contribution
It demonstrates that word co-occurrence, measured by PMI, significantly impacts model accuracy and transfers across different multimodal models, highlighting the need for improved compositional generalization methods.
Findings
Strong correlation (r=0.97) between PMI and CLIP accuracy on synthetic images.
Reproduced PMI-accuracy correlation (r=0.75) in natural images through editing.
Transfer of PMI-accuracy relationship observed in LMMs built on CLIP.
Abstract
CLIP and large multimodal models (LMMs) have better accuracy on examples involving concepts that are highly represented in the training data. However, the role of concept combinations in the training data on compositional generalization is largely unclear -- for instance, how does accuracy vary when a common object appears in an uncommon pairing with another object? In this paper, we investigate how word co-occurrence statistics in the pretraining dataset (a proxy for co-occurrence of visual concepts) impacts CLIP/LMM performance. To disentangle the effects of word co-occurrence frequencies from single-word frequencies, we measure co-occurrence with pointwise mutual information (PMI), which normalizes the joint probability of two words co-occurring by the probability of co-occurring independently. Using synthetically generated images with a variety of concept pairs, we show a strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Language and cultural evolution
