GSR: Learning Structured Reasoning for Embodied Manipulation
Kewei Hu, Michael Zhang, Wei Ying, Tianhao Liu, Guoqiang Hao, Zimeng Li, Wanchan Yu, Jiajian Jing, Fangwen Chen, Hanwen Kang

TL;DR
GSR introduces a structured reasoning framework using scene graphs for embodied manipulation, enabling better generalization and long-horizon task success by explicitly modeling world states and their evolution.
Contribution
It proposes Grounded Scene-graph Reasoning (GSR), a novel explicit reasoning paradigm for embodied agents, and provides a large-scale dataset for training and evaluation.
Findings
GSR outperforms prompting baselines in zero-shot generalization.
GSR achieves higher success rates in long-horizon tasks.
Explicit world-state modeling improves embodied reasoning.
Abstract
Despite rapid progress, embodied agents still struggle with long-horizon manipulation that requires maintaining spatial consistency, causal dependencies, and goal constraints. A key limitation of existing approaches is that task reasoning is implicitly embedded in high-dimensional latent representations, making it challenging to separate task structure from perceptual variability. We introduce Grounded Scene-graph Reasoning (GSR), a structured reasoning paradigm that explicitly models world-state evolution as transitions over semantically grounded scene graphs. By reasoning step-wise over object states and spatial relations, rather than directly mapping perception to actions, GSR enables explicit reasoning about action preconditions, consequences, and goal satisfaction in a physically grounded space. To support learning such reasoning, we construct Manip-Cognition-1.6M, a large-scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Social Robot Interaction and HRI
