An Empirical Study on the Language Modal in Visual Question Answering

Daowan Peng; Wei Wei; Xian-Ling Mao; Yuanyuan Fu; Dangyang Chen

arXiv:2305.10143·cs.AI·September 6, 2023·1 cites

An Empirical Study on the Language Modal in Visual Question Answering

Daowan Peng, Wei Wei, Xian-Ling Mao, Yuanyuan Fu, Dangyang Chen

PDF

Open Access

TL;DR

This empirical study investigates how language modality influences VQA model performance, revealing biases and proposing methods to improve out-of-distribution generalization, notably achieving significant gains without debiasing techniques.

Contribution

The paper provides new insights into language-related biases in VQA models and demonstrates effective strategies to enhance out-of-distribution performance without complex debiasing.

Findings

01

Postfix-related bias significantly affects VQA performance.

02

Training with word-sequence variants improves out-of-distribution accuracy.

03

LXMERT achieved a 10-point gain without debiasing methods.

Abstract

Generalization beyond in-domain experience to out-of-distribution data is of paramount significance in the AI domain. Of late, state-of-the-art Visual Question Answering (VQA) models have shown impressive performance on in-domain data, partially due to the language priors bias which, however, hinders the generalization ability in practice. This paper attempts to provide new insights into the influence of language modality on VQA performance from an empirical study perspective. To achieve this, we conducted a series of experiments on six models. The results of these experiments revealed that, 1) apart from prior bias caused by question types, there is a notable influence of postfix-related bias in inducing biases, and 2) training VQA models with word-sequence-related variant questions demonstrated improved performance on the out-of-distribution benchmark, and the LXMERT even achieved a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLearning Cross-Modality Encoder Representations from Transformers