An Examination of the Compositionality of Large Generative Vision-Language Models
Teli Ma, Rong Li, Junwei Liang

TL;DR
This paper critically evaluates the compositional reasoning abilities of large vision-language models, identifies biases in current benchmarks, and introduces a new unbiased benchmark, SADE, to better assess their true capabilities.
Contribution
It introduces a SyntaxBias Score to measure bias, proposes a new robust benchmark SADE, and provides insights into the limitations of current evaluation metrics for GVLMs.
Findings
Current benchmarks are biased towards syntactical correctness.
VisualGPTScore is insufficient for evaluating compositionality.
SADE offers an unbiased evaluation framework.
Abstract
With the success of Large Language Models (LLMs), many Generative Vision-Language Models (GVLMs) have been constructed via multimodal instruction tuning. However, the performance of GVLMs in multimodal compositional reasoning remains under-explored. In this paper, we examine both the evaluation metrics (VisualGPTScore, etc.) and current benchmarks for evaluating the compositionality of GVLMs. We identify the syntactical bias in current benchmarks, which is exploited by the linguistic capability of GVLMs. The bias renders VisualGPTScore an insufficient metric for assessing GVLMs. To combat this, we first introduce a SyntaxBias Score, leveraging LLMs to quantify such bias for mitigation. A challenging new task is subsequently added to evaluate the robustness of GVLMs against inherent inclination toward syntactical correctness. Using the bias-mitigated datasets and the new task, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsCategorization, perception, and language
MethodsFocus · Contrastive Language-Image Pre-training
