TL;DR
This study evaluates how well large language and vision-language models understand viewpoint rotation without visual input, revealing significant gaps and proposing targeted fine-tuning to improve their spatial reasoning capabilities.
Contribution
The paper introduces a novel dataset and analysis methods for assessing spatial viewpoint understanding in text-only models, and demonstrates effective fine-tuning of attention heads to enhance performance.
Findings
Models encode viewpoint info but struggle to link it with observations.
Models perform poorly compared to humans on viewpoint rotation tasks.
Selective fine-tuning of attention heads improves model performance.
Abstract
Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual-spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text-only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
