How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

Zhen Yang; Ping Jian; Zhongbin Guo; Zuming Zhang; Chengzhi Li; Yonghong Deng; Xinyue Zhang; Wenpeng Lu

arXiv:2604.15294·cs.AI·April 17, 2026

How Do LLMs and VLMs Understand Viewpoint Rotation Without Vision? An Interpretability Study

Zhen Yang, Ping Jian, Zhongbin Guo, Zuming Zhang, Chengzhi Li, Yonghong Deng, Xinyue Zhang, Wenpeng Lu

PDF

1 Repo

TL;DR

This study evaluates how well large language and vision-language models understand viewpoint rotation without visual input, revealing significant gaps and proposing targeted fine-tuning to improve their spatial reasoning capabilities.

Contribution

The paper introduces a novel dataset and analysis methods for assessing spatial viewpoint understanding in text-only models, and demonstrates effective fine-tuning of attention heads to enhance performance.

Findings

01

Models encode viewpoint info but struggle to link it with observations.

02

Models perform poorly compared to humans on viewpoint rotation tasks.

03

Selective fine-tuning of attention heads improves model performance.

Abstract

Over the past year, spatial intelligence has drawn increasing attention. Many prior works study it from the perspective of visual-spatial intelligence, where models have access to visuospatial information from visual inputs. However, in the absence of visual information, whether linguistic intelligence alone is sufficient to endow models with spatial intelligence, and how models perform relevant tasks with text-only inputs still remain unexplored. Therefore, in this paper, we focus on a fundamental and critical capability in spatial intelligence from a linguistic perspective: viewpoint rotation understanding (VRU). Specifically, LLMs and VLMs are asked to infer their final viewpoint and predict the corresponding observation in an environment given textual description of viewpoint rotation and observation over multiple steps. We find that both LLMs and VLMs perform poorly on our proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Young-Zhen/VRU_Interpret
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.