MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yolo Y. Tang; Pinxin Liu; Zhangyun Tan; Mingqian Feng; Rui Mao; Chao Huang; Jing Bi; Yunzhong Xiao; Susan Liang; Hang Hua; Ali Vosoughi; Luchuan Song; Zeliang Zhang; Chenliang Xu

arXiv:2505.20426·cs.CV·November 26, 2025

MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yolo Y. Tang, Pinxin Liu, Zhangyun Tan, Mingqian Feng, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu

PDF

Open Access 1 Repo

TL;DR

MMPerspective is a comprehensive benchmark designed to evaluate multimodal large language models' understanding of perspective, revealing their strengths and limitations in perception, reasoning, and robustness across diverse tasks.

Contribution

This work introduces the first systematic benchmark for assessing perspective understanding in MLLMs, covering perception, reasoning, and robustness with real-world and synthetic data.

Findings

01

Models perform well on perceptual tasks but struggle with reasoning.

02

Significant limitations in spatial reasoning and invariance to transformations.

03

Chain-of-thought prompting improves perspective reasoning.

Abstract

Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs' understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yunlong10/MMPerspective
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques