GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
Saumya Saxena, Blake Buchanan, Chris Paxton, Peiqi Liu, Bingqing Chen, Narunas Vaskevicius, Luigi Palmieri, Jonathan Francis, Oliver Kroemer

TL;DR
GraphEQA introduces a real-time 3D semantic scene graph approach for embodied question answering, enabling robots to explore and understand unseen environments more effectively and efficiently.
Contribution
It is the first to integrate real-time 3D semantic scene graphs with vision-language models for EQA in unseen environments, improving planning and exploration.
Findings
Outperforms baselines in success rate and efficiency on benchmark datasets.
Effective in real-world home and office environments.
Utilizes hierarchical planning with 3D scene graphs for structured exploration.
Abstract
In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment to answer a situated question with confidence. This problem remains challenging in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient planning and exploration. To address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantics-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
