EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs
Yuping He, Yifei Huang, Guo Chen, Baoqi Pei, Jilan Xu, Tong Lu, Jiangmiao Pang

TL;DR
EgoExoBench is a new benchmark designed to evaluate multimodal large language models on their ability to understand and reason across first-person and third-person video perspectives, addressing a key gap in current AI capabilities.
Contribution
This paper introduces EgoExoBench, the first comprehensive benchmark for egocentric-exocentric video understanding, with datasets, tasks, and evaluation of current models' cross-view reasoning abilities.
Findings
Current models perform well on single-view tasks but struggle with cross-view semantic alignment.
Models have difficulty associating viewpoints accurately across egocentric and exocentric videos.
Temporal reasoning in cross-view contexts remains a significant challenge for existing models.
Abstract
Transferring and integrating knowledge across first-person (egocentric) and third-person (exocentric) viewpoints is intrinsic to human intelligence, enabling humans to learn from others and convey insights from their own experiences. Despite rapid progress in multimodal large language models (MLLMs), their ability to perform such cross-view reasoning remains unexplored. To address this, we introduce EgoExoBench, the first benchmark for egocentric-exocentric video understanding and reasoning. Built from publicly available datasets, EgoExoBench comprises over 7,300 question-answer pairs spanning eleven sub-tasks organized into three core challenges: semantic alignment, viewpoint association, and temporal reasoning. We evaluate 13 state-of-the-art MLLMs and find that while these models excel on single-view tasks, they struggle to align semantics across perspectives, accurately associate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Data Compression Techniques · Video Analysis and Summarization
