Knowing Earlier what Right Means to You: A Comprehensive VQA Dataset for Grounding Relative Directions via Multi-Task Learning
Kyra Ahrens, Matthias Kerzel, Jae Hee Lee, Cornelius Weber, Stefan, Wermter

TL;DR
This paper introduces GRiD-A-3D, a new VQA dataset for grounding relative directions, enabling detailed analysis of models' spatial reasoning with fewer resources, and shows models learn reasoning subtasks in an intuitive order.
Contribution
The paper presents GRiD-A-3D, a novel dataset for evaluating spatial reasoning in VQA, along with an analysis of model learning dynamics on this dataset.
Findings
Models quickly learn sub-tasks like object recognition and orientation estimation.
Training on GRiD-A-3D requires less computational resources.
Models develop an intuitive reasoning order over epochs.
Abstract
Spatial reasoning poses a particular challenge for intelligent agents and is at the same time a prerequisite for their successful interaction and communication in the physical world. One such reasoning task is to describe the position of a target object with respect to the intrinsic orientation of some reference object via relative directions. In this paper, we introduce GRiD-A-3D, a novel diagnostic visual question-answering (VQA) dataset based on abstract objects. Our dataset allows for a fine-grained analysis of end-to-end VQA models' capabilities to ground relative directions. At the same time, model training requires considerably fewer computational resources compared with existing datasets, yet yields a comparable or even higher performance. Along with the new dataset, we provide a thorough evaluation based on two widely known end-to-end VQA architectures trained on GRiD-A-3D. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
