The Role of Linguistic Priors in Measuring Compositional Generalization of Vision-Language Models
Chenwei Wu, Li Erran Li, Stefano Ermon, Patrick Haffner, Rong Ge,, Zaiwei Zhang

TL;DR
This paper investigates how linguistic priors influence the measurement of compositional generalization in vision-language models, revealing that current improvements mainly rely on priors rather than visual information, and introduces a new metric to address this.
Contribution
The paper identifies the reliance on linguistic priors in current compositionality assessments and proposes a new metric that minimizes this bias.
Findings
Current methods depend heavily on linguistic priors.
A new metric for compositionality reduces prior bias.
Insights into the interplay between images and texts in models.
Abstract
Compositionality is a common property in many modalities including natural languages and images, but the compositional generalization of multi-modal models is not well-understood. In this paper, we identify two sources of visual-linguistic compositionality: linguistic priors and the interplay between images and texts. We show that current attempts to improve compositional generalization rely on linguistic priors rather than on information in the image. We also propose a new metric for compositionality without such linguistic priors.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
