What is needed for simple spatial language capabilities in VQA?
Alexander Kuhnle, Ann Copestake

TL;DR
This paper investigates the core components needed for simple spatial language understanding in visual question answering models, analyzing various models and modifications to improve spatial reasoning capabilities.
Contribution
It provides a comparative analysis of models on spatial relation tasks and identifies key factors for enhancing spatial language abilities in VQA systems.
Findings
Certain model modifications significantly improve spatial reasoning.
Model performance varies notably across different spatial relation types.
Insights into the minimal requirements for effective spatial language understanding.
Abstract
Visual question answering (VQA) comprises a variety of language capabilities. The diagnostic benchmark dataset CLEVR has fueled progress by helping to better assess and distinguish models in basic abilities like counting, comparing and spatial reasoning in vitro. Following this approach, we focus on spatial language capabilities and investigate the question: what are the key ingredients to handle simple visual-spatial relations? We look at the SAN, RelNet, FiLM and MC models and evaluate their learning behavior on diagnostic data which is solely focused on spatial relations. Via comparative analysis and targeted model modification we identify what really is required to substantially improve upon the CNN-LSTM baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
