Scaling World Model for Hierarchical Manipulation Policies

Qian Long; Yueze Wang; Jiaxi Song; Junbo Zhang; Peiyan Li; Wenxuan Wang; Yuqi Wang; Haoyang Li; Shaoxuan Xie; Guocai Yao; Hanbo Zhang; Xinlong Wang; Zhongyuan Wang; Xuguang Lan; Huaping Liu; and Xinghang Li

arXiv:2602.10983·cs.RO·February 13, 2026

Scaling World Model for Hierarchical Manipulation Policies

Qian Long, Yueze Wang, Jiaxi Song, Junbo Zhang, Peiyan Li, Wenxuan Wang, Yuqi Wang, Haoyang Li, Shaoxuan Xie, Guocai Yao, Hanbo Zhang, Xinlong Wang, Zhongyuan Wang, Xuguang Lan, Huaping Liu, and Xinghang Li

PDF

Open Access

TL;DR

This paper introduces a hierarchical vision-language-action framework that leverages a large-scale pre-trained world model to improve robot manipulation in out-of-distribution scenarios, significantly enhancing generalization and performance.

Contribution

The novel hierarchical framework combines a high-level world model with low-level visual-language policies, enabling better generalization across unseen objects and scenarios.

Findings

01

Performance boost from 14% to 69% in OOD scenarios

02

Outperforms previous baselines in out-of-distribution tasks

03

Effective visual goal synthesis improves manipulation success

Abstract

Vision-Language-Action (VLA) models are promising for generalist robot manipulation but remain brittle in out-of-distribution (OOD) settings, especially with limited real-robot data. To resolve the generalization bottleneck, we introduce a hierarchical Vision-Language-Action framework \our{} that leverages the generalization of large-scale pre-trained world model for robust and generalizable VIsual Subgoal TAsk decomposition VISTA. Our hierarchical framework \our{} consists of a world model as the high-level planner and a VLA as the low-level executor. The high-level world model first divides manipulation tasks into subtask sequences with goal images, and the low-level policy follows the textual and visual guidance to generate action sequences. Compared to raw textual goal specification, these synthesized goal images provide visually and physically grounded details for low-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics