Things not Written in Text: Exploring Spatial Commonsense from Visual Signals
Xiao Liu, Da Yin, Yansong Feng, Dongyan Zhao

TL;DR
This paper investigates whether visual signals improve spatial commonsense reasoning in AI models, finding that image synthesis models outperform language models in learning and applying spatial knowledge.
Contribution
The study introduces a new spatial commonsense benchmark and demonstrates that image synthesis models better learn spatial relationships than text-based models.
Findings
Image synthesis models outperform PLMs in spatial reasoning.
Spatial knowledge from image models aids natural language understanding.
Proposed benchmark effectively evaluates spatial commonsense.
Abstract
Spatial commonsense, the knowledge about spatial position and relationship between objects (like the relative size of a lion and a girl, and the position of a boy relative to a bicycle when cycling), is an important part of commonsense knowledge. Although pretrained language models (PLMs) succeed in many NLP tasks, they are shown to be ineffective in spatial commonsense reasoning. Starting from the observation that images are more likely to exhibit spatial commonsense than texts, we explore whether models with visual signals learn more spatial commonsense than text-based PLMs. We propose a spatial commonsense benchmark that focuses on the relative scales of objects, and the positional relationship between people and objects under different actions. We probe PLMs and models with visual signals, including vision-language pretrained models and image synthesis models, on this benchmark, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems
