Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation
Minghao Jin, Mozheng Liao, Mingfei Han, Zhihui Li, Xiaojun Chang

TL;DR
This paper introduces StructVLA, a structured planning approach for robotic manipulation that predicts sparse, meaningful frames based on kinematic cues, improving long-horizon planning and control over dense prediction methods.
Contribution
The paper proposes a novel structured world model that predicts sparse, physically meaningful frames, bridging visual planning and low-level motion control in robotic manipulation.
Findings
Achieved 75.0% success on SimplerEnv-WidowX
Achieved 94.8% success on LIBERO
Demonstrated robust real-world task generalization
Abstract
Recent world-model-based Vision-Language-Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long-horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high-level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low-level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Robotic Path Planning Algorithms
