GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

Saumya Saxena; Blake Buchanan; Chris Paxton; Peiqi Liu; Bingqing Chen; Narunas Vaskevicius; Luigi Palmieri; Jonathan Francis; Oliver Kroemer

arXiv:2412.14480·cs.RO·September 25, 2025

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

Saumya Saxena, Blake Buchanan, Chris Paxton, Peiqi Liu, Bingqing Chen, Narunas Vaskevicius, Luigi Palmieri, Jonathan Francis, Oliver Kroemer

PDF

Open Access

TL;DR

GraphEQA introduces a real-time 3D semantic scene graph approach for embodied question answering, enabling robots to explore and understand unseen environments more effectively and efficiently.

Contribution

It is the first to integrate real-time 3D semantic scene graphs with vision-language models for EQA in unseen environments, improving planning and exploration.

Findings

01

Outperforms baselines in success rate and efficiency on benchmark datasets.

02

Effective in real-world home and office environments.

03

Utilizes hierarchical planning with 3D scene graphs for structured exploration.

Abstract

In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment to answer a situated question with confidence. This problem remains challenging in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient planning and exploration. To address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantics-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques