Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Da Yin; Liunian Harold Li; Ziniu Hu; Nanyun Peng; Kai-Wei Chang

arXiv:2109.06860·cs.CL·September 15, 2021

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Da Yin, Liunian Harold Li, Ziniu Hu, Nanyun Peng, Kai-Wei Chang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces GD-VCR, a dataset to evaluate vision-and-language models' understanding of culturally and geographically specific commonsense, revealing significant performance gaps across regions and scenarios.

Contribution

The paper creates a new geo-diverse visual commonsense dataset and analyzes the limitations of existing models in understanding regional cultural differences.

Findings

01

Models perform worse on non-Western regions.

02

Performance drops are larger on culture-related questions.

03

High-level geo-diverse reasoning is more challenging for models.

Abstract

Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wadeyin9712/gd-vcr
pytorchOfficial

Datasets

CulTex-VLM/EC-VCR
dataset· 190 dl
190 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsVisualBERT · Vision-and-Language BERT