Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical   Study of VCR

Zhenyang Li; Yangyang Guo; Kejie Wang; Xiaolin Chen; Liqiang Nie,; Mohan Kankanhalli

arXiv:2405.16934·cs.CV·May 28, 2024

Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR

Zhenyang Li, Yangyang Guo, Kejie Wang, Xiaolin Chen, Liqiang Nie,, Mohan Kankanhalli

PDF

Open Access

TL;DR

This paper critically evaluates Vision-Language Transformers in the context of Visual Commonsense Reasoning, revealing they lack true visual commonsense and highlighting key shortcomings in current models and datasets.

Contribution

It provides an empirical analysis showing that VL Transformers do not effectively exhibit visual commonsense and identifies specific limitations impacting their performance.

Findings

01

Limited gains from pre-training on VL Transformers

02

Presence of unexpected language bias in models

03

Neglect of object-tag correlation in current architectures

Abstract

Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. To achieve this goal, a model is required to provide an acceptable rationale as the reason for the predicted answers. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers). These models are first pre-trained on some generic large-scale vision-text datasets, and then the learned representations are transferred to the downstream VCR task. Despite their attractive performance, this paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR. In particular, our empirical results pinpoint several shortcomings of existing VL Transformers: small gains from pre-training, unexpected language bias, limited model architecture for the two inseparable sub-tasks, and neglect of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTactile and Sensory Interactions