MoMaStage: Skill-State Graph Guided Planning and Closed-Loop Execution for Long-Horizon Indoor Mobile Manipulation

Chenxu Li; Zixuan Chen; Yetao Li; Jiapeng Xu; Hongyu Ding; Jieqi Shi; Jing Huo; Yang Gao

arXiv:2603.08383·cs.RO·March 10, 2026

MoMaStage: Skill-State Graph Guided Planning and Closed-Loop Execution for Long-Horizon Indoor Mobile Manipulation

Chenxu Li, Zixuan Chen, Yetao Li, Jiapeng Xu, Hongyu Ding, Jieqi Shi, Jing Huo, Yang Gao

PDF

Open Access

TL;DR

MoMaStage introduces a structured vision-language framework for long-horizon indoor mobile manipulation that ensures logical consistency and robustness without explicit scene mapping, significantly improving task success in complex environments.

Contribution

It proposes a novel hierarchical skill and skill-state graph grounded in vision-language models, enabling flexible, topologically valid planning and closed-loop execution without explicit scene mapping.

Findings

01

Outperforms state-of-the-art baselines in simulation and real-world tasks.

02

Achieves higher planning success and task completion rates.

03

Reduces token overhead and improves robustness in dynamic environments.

Abstract

Indoor mobile manipulation (MoMA) enables robots to translate natural language instructions into physical actions, yet long-horizon execution remains challenging due to cascading errors and limited generalization across diverse environments. Learning-based approaches often fail to maintain logical consistency over extended horizons, while methods relying on explicit scene representations impose rigid structural assumptions that reduce adaptability in dynamic settings. To address these limitations, we propose MoMaStage, a structured vision-language framework for long-horizon MoMA that eliminates the need for explicit scene mapping. MoMaStage grounds a Vision-Language Model (VLM) within a Hierarchical Skill Library and a topology-aware Skill-State Graph, constraining task decomposition and skill composition within a feasible transition space. This structured grounding ensures that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Robotic Path Planning Algorithms