MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Yuanchen Ju; Yongyuan Liang; Yen-Jen Wang; Nandiraju Gireesh; Yuanliang Ju; Seungjae Lee; Qiao Gu; Elvis Hsieh; Furong Huang; Koushil Sreenath

arXiv:2512.16909·cs.CV·February 10, 2026

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath

PDF

Open Access

TL;DR

MomaGraph introduces a unified, task-driven scene graph representation for embodied agents, supported by a large dataset and a vision-language model that excels in zero-shot planning and understanding in household environments.

Contribution

The paper presents MomaGraph, a novel unified scene graph model with a large dataset and evaluation suite, enabling improved embodied task planning and scene understanding.

Findings

01

Achieved 71.6% accuracy on MomaGraph-Bench, surpassing baselines by 11.4%.

02

Demonstrated effective zero-shot task planning and generalization to real robots.

03

Provided the first large-scale, task-driven scene graph dataset for household environments.

Abstract

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Reinforcement Learning in Robotics