The Role of Linguistic Priors in Measuring Compositional Generalization   of Vision-Language Models

Chenwei Wu; Li Erran Li; Stefano Ermon; Patrick Haffner; Rong Ge,; Zaiwei Zhang

arXiv:2310.02777·cs.CL·October 5, 2023

The Role of Linguistic Priors in Measuring Compositional Generalization of Vision-Language Models

Chenwei Wu, Li Erran Li, Stefano Ermon, Patrick Haffner, Rong Ge,, Zaiwei Zhang

PDF

Open Access

TL;DR

This paper investigates how linguistic priors influence the measurement of compositional generalization in vision-language models, revealing that current improvements mainly rely on priors rather than visual information, and introduces a new metric to address this.

Contribution

The paper identifies the reliance on linguistic priors in current compositionality assessments and proposes a new metric that minimizes this bias.

Findings

01

Current methods depend heavily on linguistic priors.

02

A new metric for compositionality reduces prior bias.

03

Insights into the interplay between images and texts in models.

Abstract

Compositionality is a common property in many modalities including natural languages and images, but the compositional generalization of multi-modal models is not well-understood. In this paper, we identify two sources of visual-linguistic compositionality: linguistic priors and the interplay between images and texts. We show that current attempts to improve compositional generalization rely on linguistic priors rather than on information in the image. We also propose a new metric for compositionality without such linguistic priors.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling