Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

Insu Lee; Wooje Park; Jaeyun Jang; Minyoung Noh; Kyuhong Shim; Byonghyo Shim

arXiv:2505.21955·cs.CV·October 27, 2025

Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

Insu Lee, Wooje Park, Jaeyun Jang, Minyoung Noh, Kyuhong Shim, Byonghyo Shim

PDF

Open Access

TL;DR

This paper enhances scene understanding in vision-language models by integrating egocentric and exocentric views, introducing a new benchmark and a training-free prompting method that improves multi-view reasoning performance.

Contribution

It introduces E3VQA, a new multi-view question answering benchmark, and M3CoT, a novel prompting technique for better multi-view scene reasoning in LVLMs.

Findings

01

M3CoT improves reasoning accuracy by approximately 5% over baseline methods.

02

E3VQA provides high-quality multi-view question-answer pairs for benchmarking.

03

Multi-view integration enhances LVLMs' understanding of complex scenes.

Abstract

Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where a first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, its narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis