Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs
Insu Lee, Wooje Park, Jaeyun Jang, Minyoung Noh, Kyuhong Shim, Byonghyo Shim

TL;DR
This paper enhances scene understanding in vision-language models by integrating egocentric and exocentric views, introducing a new benchmark and a training-free prompting method that improves multi-view reasoning performance.
Contribution
It introduces E3VQA, a new multi-view question answering benchmark, and M3CoT, a novel prompting technique for better multi-view scene reasoning in LVLMs.
Findings
M3CoT improves reasoning accuracy by approximately 5% over baseline methods.
E3VQA provides high-quality multi-view question-answer pairs for benchmarking.
Multi-view integration enhances LVLMs' understanding of complex scenes.
Abstract
Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where a first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, its narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
