GraphPad: Inference-Time 3D Scene Graph Updates for Embodied Question Answering

Muhammad Qasim Ali; Saeejith Nair; Alexander Wong; Yuchen Cui; Yuhao Chen

arXiv:2506.01174·cs.AI·June 3, 2025

GraphPad: Inference-Time 3D Scene Graph Updates for Embodied Question Answering

Muhammad Qasim Ali, Saeejith Nair, Alexander Wong, Yuchen Cui, Yuhao Chen

PDF

Open Access

TL;DR

GraphPad introduces a dynamic, modifiable 3D scene graph memory for embodied agents, enabling real-time updates and task-specific refinement that improve scene understanding and question answering performance.

Contribution

It presents GraphPad, a novel API-driven, mutable scene graph memory system that adapts during tasks without additional training, enhancing embodied agent capabilities.

Findings

01

Achieves 55.3% accuracy on OpenEQA, outperforming static baselines.

02

Operates with five times fewer input frames, reducing computational load.

03

Enables online, language-driven scene graph updates for better task alignment.

Abstract

Structured scene representations are a core component of embodied agents, helping to consolidate raw sensory streams into readable, modular, and searchable formats. Due to their high computational overhead, many approaches build such representations in advance of the task. However, when the task specifications change, such static approaches become inadequate as they may miss key objects, spatial relations, and details. We introduce GraphPad, a modifiable structured memory that an agent can tailor to the needs of the task through API calls. It comprises a mutable scene graph representing the environment, a navigation log indexing frame-by-frame content, and a scratchpad for task-specific notes. Together, GraphPad serves as a dynamic workspace that remains complete, current, and aligned with the agent's immediate understanding of the scene and its task. On the OpenEQA benchmark, GraphPad…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition