Weakly Supervised Relative Spatial Reasoning for Visual Question Answering
Pratyay Banerjee, Tejas Gokhale, Yezhou Yang, Chitta Baral

TL;DR
This paper evaluates the spatial reasoning abilities of vision-and-language models and introduces weakly supervised training objectives to improve their understanding of relative object locations, significantly enhancing VQA performance.
Contribution
It proposes weak supervision methods for spatial reasoning in V extbackslash{}L models, addressing their limitations in understanding relative object positions.
Findings
Improved accuracy on GQA VQA benchmark.
Enhanced relative spatial reasoning capabilities.
State-of-the-art transformer models still lack spatial understanding.
Abstract
Vision-and-language (V\&L) reasoning necessitates perception of visual concepts such as objects and actions, understanding semantics and language grounding, and reasoning about the interplay between the two modalities. One crucial aspect of visual reasoning is spatial understanding, which involves understanding relative locations of objects, i.e.\ implicitly learning the geometry of the scene. In this work, we evaluate the faithfulness of V\&L models to such geometric understanding, by formulating the prediction of pair-wise relative locations of objects as a classification as well as a regression task. Our findings suggest that state-of-the-art transformer-based V\&L models lack sufficient abilities to excel at this task. Motivated by this, we design two objectives as proxies for 3D spatial reasoning (SR) -- object centroid estimation, and relative position estimation, and train V\&L…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
