TL;DR
ST-$\pi$ introduces a structured spatiotemporal vision-language-action model that explicitly encodes and reasons about sub-tasks, spatial, and temporal boundaries for improved robotic manipulation.
Contribution
The paper presents a novel structured framework combining spatiotemporal encoding and dual-generator guidance for explicit reasoning in robotic manipulation tasks.
Findings
Effective spatiotemporal reasoning improves manipulation accuracy.
Explicit planning of sub-tasks enhances multi-step task performance.
The model outperforms existing methods on the proposed dataset.
Abstract
Vision-language-action (VLA) models have achieved great success on general robotic tasks, but still face challenges in fine-grained spatiotemporal manipulation. Typically, existing methods mainly embed spatiotemporal knowledge into visual and action representations, and directly perform a cross-modal mapping for step-level action prediction. However, such spatiotemporal reasoning remains largely implicit, making it difficult to handle multiple sequential behaviors with explicit spatiotemporal boundaries. In this work, we propose ST-, a structured spatiotemporal VLA model for robotic manipulation. Our model is guided by two key designs: 1) Spatiotemporal VLM. We encode 4D observations and task instructions into latent spaces, and feed them into the LLM to generate a sequence of causally ordered chunk-level action prompts consisting of sub-tasks, spatial grounding and temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
