Situational Awareness Matters in 3D Vision Language Reasoning

Yunze Man; Liang-Yan Gui; Yu-Xiong Wang

arXiv:2406.07544·cs.CV·June 27, 2024

Situational Awareness Matters in 3D Vision Language Reasoning

Yunze Man, Liang-Yan Gui, Yu-Xiong Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces SIG3D, a novel model that enhances 3D vision language reasoning by incorporating situational awareness, enabling robots to better understand and interact within complex 3D environments.

Contribution

We propose SIG3D, an end-to-end model that grounds self-location and answers questions from the agent's perspective in 3D scenes, advancing the state-of-the-art in 3D vision language reasoning.

Findings

01

SIG3D outperforms existing models in situation estimation accuracy by over 30%.

02

The model effectively integrates visual and textual tokens for improved reasoning.

03

Situational awareness significantly improves 3D question answering performance.

Abstract

Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YunzeMan/Situation3D
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCategorization, perception, and language