ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

Chuanhao Ma; Hanyu Zhou; Shihan Peng; Yan Li; Tao Gu; Luxin Yan

arXiv:2604.17880·cs.RO·April 21, 2026

ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

Chuanhao Ma, Hanyu Zhou, Shihan Peng, Yan Li, Tao Gu, Luxin Yan

PDF

1 Repo

TL;DR

ST-$\pi$ introduces a structured spatiotemporal vision-language-action model that explicitly encodes and reasons about sub-tasks, spatial, and temporal boundaries for improved robotic manipulation.

Contribution

The paper presents a novel structured framework combining spatiotemporal encoding and dual-generator guidance for explicit reasoning in robotic manipulation tasks.

Findings

01

Effective spatiotemporal reasoning improves manipulation accuracy.

02

Explicit planning of sub-tasks enhances multi-step task performance.

03

The model outperforms existing methods on the proposed dataset.

Abstract

Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST- $π$ , a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chuanhaoma/ST-pi
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.