MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Ruicheng Zhang; Mingyang Zhang; Jun Zhou; Zhangrui Guo; Zunnan Xu; Xiaofan Liu; Zhizhou Zhong; Puxin Yan; Haocheng Luo; Xiu Li

arXiv:2512.06628·cs.RO·March 16, 2026

MIND-V: Hierarchical World Model for Long-Horizon Robotic Manipulation with RL-based Physical Alignment

Ruicheng Zhang, Mingyang Zhang, Jun Zhou, Zhangrui Guo, Zunnan Xu, Xiaofan Liu, Zhizhou Zhong, Puxin Yan, Haocheng Luo, Xiu Li

PDF

Open Access

TL;DR

MIND-V is a hierarchical world model that synthesizes long-horizon, physically plausible videos of robotic manipulation by integrating high-level reasoning, domain-invariant representations, and physics-aware reinforcement learning.

Contribution

The paper introduces MIND-V, a novel hierarchical world model that combines cognitive-inspired components and physics-based training for scalable long-horizon robotic simulation.

Findings

01

Achieves state-of-the-art performance in long-horizon simulation tasks.

02

Enhances policy learning through realistic, physics-consistent data.

03

Demonstrates robustness with staged visual future rollouts.

Abstract

Scalable embodied intelligence is constrained by the scarcity of diverse, long-horizon robotic manipulation data. Existing video world models in this domain are limited to synthesizing short clips of simple actions and often rely on manually defined trajectories. To this end, we introduce MIND-V, a cognitive hierarchical world model designed to synthesize physically plausible and logically coherent videos of long-horizon robotic manipulation. Inspired by cognitive science, MIND-V bridges high-level reasoning with pixel-level synthesis through three core components: a Semantic Reasoning Hub (SRH) that leverages a pre-trained vision-language model for task planning; a Behavioral Semantic Bridge (BSB) that translates abstract instructions into domain-invariant representations; and a Motor Video Generator (MVG) for conditional video rendering. MIND-V employs Staged Visual Future Rollouts, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis · Reinforcement Learning in Robotics