Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR
Zhenyang Li, Yangyang Guo, Kejie Wang, Xiaolin Chen, Liqiang Nie,, Mohan Kankanhalli

TL;DR
This paper critically evaluates Vision-Language Transformers in the context of Visual Commonsense Reasoning, revealing they lack true visual commonsense and highlighting key shortcomings in current models and datasets.
Contribution
It provides an empirical analysis showing that VL Transformers do not effectively exhibit visual commonsense and identifies specific limitations impacting their performance.
Findings
Limited gains from pre-training on VL Transformers
Presence of unexpected language bias in models
Neglect of object-tag correlation in current architectures
Abstract
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes. To achieve this goal, a model is required to provide an acceptable rationale as the reason for the predicted answers. Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers). These models are first pre-trained on some generic large-scale vision-text datasets, and then the learned representations are transferred to the downstream VCR task. Despite their attractive performance, this paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR. In particular, our empirical results pinpoint several shortcomings of existing VL Transformers: small gains from pre-training, unexpected language bias, limited model architecture for the two inseparable sub-tasks, and neglect of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTactile and Sensory Interactions
