An Examination of the Compositionality of Large Generative   Vision-Language Models

Teli Ma; Rong Li; Junwei Liang

arXiv:2308.10509·cs.CL·April 2, 2024

An Examination of the Compositionality of Large Generative Vision-Language Models

Teli Ma, Rong Li, Junwei Liang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper critically evaluates the compositional reasoning abilities of large vision-language models, identifies biases in current benchmarks, and introduces a new unbiased benchmark, SADE, to better assess their true capabilities.

Contribution

It introduces a SyntaxBias Score to measure bias, proposes a new robust benchmark SADE, and provides insights into the limitations of current evaluation metrics for GVLMs.

Findings

01

Current benchmarks are biased towards syntactical correctness.

02

VisualGPTScore is insufficient for evaluating compositionality.

03

SADE offers an unbiased evaluation framework.

Abstract

With the success of Large Language Models (LLMs), many Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. However, the performance of GVLMs in multimodal compositional reasoning remains under-explored. In this paper, we examine both the evaluation metrics (VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs. The bias renders VisualGPTScore an insufficient metric for assessing GVLMs. To combat this, we first introduce a SyntaxBias Score, leveraging LLMs to quantify such bias for mitigation. A challenging new task is subsequently added to evaluate the robustness of GVLMs against inherent inclination toward syntactical correctness. Using the bias-mitigated datasets and the new task, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

teleema/sade
noneOfficial

Videos

An Examination of the Compositionality of Large Generative Vision-Language Models· underline

Taxonomy

TopicsCategorization, perception, and language

MethodsFocus · Contrastive Language-Image Pre-training