Situational Awareness Matters in 3D Vision Language Reasoning
Yunze Man, Liang-Yan Gui, Yu-Xiong Wang

TL;DR
This paper introduces SIG3D, a novel model that enhances 3D vision language reasoning by incorporating situational awareness, enabling robots to better understand and interact within complex 3D environments.
Contribution
We propose SIG3D, an end-to-end model that grounds self-location and answers questions from the agent's perspective in 3D scenes, advancing the state-of-the-art in 3D vision language reasoning.
Findings
SIG3D outperforms existing models in situation estimation accuracy by over 30%.
The model effectively integrates visual and textual tokens for improved reasoning.
Situational awareness significantly improves 3D question answering performance.
Abstract
Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCategorization, perception, and language
