STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

Yuxuan Tian; Yurun Jin; Bin Yu; Yukun Shi; Hao Wu; Chi Harold Liu; Kai Chen; Cong Huang

arXiv:2604.26848·cs.RO·May 4, 2026

STARRY: Spatial-Temporal Action-Centric World Modeling for Robotic Manipulation

Yuxuan Tian, Yurun Jin, Bin Yu, Yukun Shi, Hao Wu, Chi Harold Liu, Kai Chen, Cong Huang

PDF

TL;DR

STARRY introduces a unified diffusion-based world model for robotic manipulation, improving spatial-temporal reasoning and action generation, leading to higher success rates in complex tasks.

Contribution

The paper presents STARRY, a novel world-model-enhanced policy with Geometry-Aware Selective Attention Modulation for better spatial-temporal control in robotics.

Findings

01

Achieves over 93% success on RoboTwin 2.0 tasks.

02

Improves real-world success rate from 42.5% to 70.8%.

03

Demonstrates effective spatial-temporal reasoning in manipulation.

Abstract

Robotic manipulation requires reasoning about future spatial-temporal interactions and geometric constraints, yet existing Vision-Language-Action (VLA) policies often leave predictive representation weakly coupled with action execution, causing failures in tasks requiring precise spatial-temporal coordination. We propose STARRY, a world-model-enhanced action-generation policy that aligns spatial-temporal prediction and action generation by jointly denoising future spatial-temporal latents and actions through a unified diffusion process. To bridge 2D visual tokens and 3D metric control, STARRY introduces Geometry-Aware Selective Attention Modulation (GASAM), which converts predicted depth and end-effector geometry into token-aligned weights for selective action-attention modulation. On RoboTwin 2.0, STARRY achieves 93.82% / 93.30% average success under Clean and Randomized settings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.