MoMaStage: Skill-State Graph Guided Planning and Closed-Loop Execution for Long-Horizon Indoor Mobile Manipulation
Chenxu Li, Zixuan Chen, Yetao Li, Jiapeng Xu, Hongyu Ding, Jieqi Shi, Jing Huo, Yang Gao

TL;DR
MoMaStage introduces a structured vision-language framework for long-horizon indoor mobile manipulation that ensures logical consistency and robustness without explicit scene mapping, significantly improving task success in complex environments.
Contribution
It proposes a novel hierarchical skill and skill-state graph grounded in vision-language models, enabling flexible, topologically valid planning and closed-loop execution without explicit scene mapping.
Findings
Outperforms state-of-the-art baselines in simulation and real-world tasks.
Achieves higher planning success and task completion rates.
Reduces token overhead and improves robustness in dynamic environments.
Abstract
Indoor mobile manipulation (MoMA) enables robots to translate natural language instructions into physical actions, yet long-horizon execution remains challenging due to cascading errors and limited generalization across diverse environments. Learning-based approaches often fail to maintain logical consistency over extended horizons, while methods relying on explicit scene representations impose rigid structural assumptions that reduce adaptability in dynamic settings. To address these limitations, we propose MoMaStage, a structured vision-language framework for long-horizon MoMA that eliminates the need for explicit scene mapping. MoMaStage grounds a Vision-Language Model (VLM) within a Hierarchical Skill Library and a topology-aware Skill-State Graph, constraining task decomposition and skill composition within a feasible transition space. This structured grounding ensures that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Robotic Path Planning Algorithms
