Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

Minghao Jin; Mozheng Liao; Mingfei Han; Zhihui Li; Xiaojun Chang

arXiv:2603.12553·cs.RO·March 16, 2026

Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation

Minghao Jin, Mozheng Liao, Mingfei Han, Zhihui Li, Xiaojun Chang

PDF

Open Access

TL;DR

This paper introduces StructVLA, a structured planning approach for robotic manipulation that predicts sparse, meaningful frames based on kinematic cues, improving long-horizon planning and control over dense prediction methods.

Contribution

The paper proposes a novel structured world model that predicts sparse, physically meaningful frames, bridging visual planning and low-level motion control in robotic manipulation.

Findings

01

Achieved 75.0% success on SimplerEnv-WidowX

02

Achieved 94.8% success on LIBERO

03

Demonstrated robust real-world task generalization

Abstract

Recent world-model-based Vision-Language-Action (VLA) architectures have improved robotic manipulation through predictive visual foresight. However, dense future prediction introduces visual redundancy and accumulates errors, causing long-horizon plan drift. Meanwhile, recent sparse methods typically represent visual foresight using high-level semantic subtasks or implicit latent states. These representations often lack explicit kinematic grounding, weakening the alignment between planning and low-level execution. To address this, we propose StructVLA, which reformulates a generative world model into an explicit structured planner for reliable control. Instead of dense rollouts or semantic goals, StructVLA predicts sparse, physically meaningful structured frames. Derived from intrinsic kinematic cues (e.g., gripper transitions and kinematic turning points), these frames capture…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Robotic Path Planning Algorithms