Can you even tell left from right? Presenting a new challenge for VQA
Sai Raam Venkatraman, Rishi Rao, S. Balasubramanian, Chandra Sekhar, Vorugunti, R. Raghunatha Sarma

TL;DR
This paper introduces UOUC, a synthetic VQA dataset designed to evaluate and improve models' compositional generalisation, reasoning, and memorisation abilities, addressing limitations of existing datasets.
Contribution
The creation of UOUC, a large, well-separated synthetic dataset with diverse questions to challenge and evaluate VQA models' compositional and reasoning skills.
Findings
Current VQA models show poor compositional generalisation.
Models perform relatively worse on simple reasoning tasks.
UOUC is a strong benchmark for future VQA research.
Abstract
Visual Question Answering (VQA) needs a means of evaluating the strengths and weaknesses of models. One aspect of such an evaluation is the evaluation of compositional generalisation, or the ability of a model to answer well on scenes whose scene-setups are different from the training set. Therefore, for this purpose, we need datasets whose train and test sets differ significantly in composition. In this work, we present several quantitative measures of compositional separation and find that popular datasets for VQA are not good evaluators. To solve this, we present Uncommon Objects in Unseen Configurations (UOUC), a synthetic dataset for VQA. UOUC is at once fairly complex while also being well-separated, compositionally. The object-class of UOUC consists of 380 clasess taken from 528 characters from the Dungeons and Dragons game. The train set of UOUC consists of 200,000 scenes;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Can You Even Tell Left From Right? Presenting a New Challenge for VQA· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
