LaST$_{0}$: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model

Zhuoyang Liu; Jiaming Liu; Hao Chen; Jiale Yu; Ziyu Guo; Chengkai Hou; Chenyang Gu; Xiangju Mi; Renrui Zhang; Kun Wu; Zhengping Che; Jian Tang; Pheng-Ann Heng; Shanghang Zhang

arXiv:2601.05248·cs.RO·March 31, 2026

LaST$_{0}$: Latent Spatio-Temporal Chain-of-Thought for Robotic Vision-Language-Action Model

Zhuoyang Liu, Jiaming Liu, Hao Chen, Jiale Yu, Ziyu Guo, Chengkai Hou, Chenyang Gu, Xiangju Mi, Renrui Zhang, Kun Wu, Zhengping Che, Jian Tang, Pheng-Ann Heng, Shanghang Zhang

PDF

TL;DR

LaST$_{0}$ introduces a latent spatio-temporal reasoning framework for robotic vision-language-action tasks, improving efficiency and physical attribute modeling over prior methods.

Contribution

It proposes a token-efficient latent CoT space and a dual-system transformer architecture for implicit reasoning and high-frequency action generation in robotics.

Findings

01

LaST$_{0}$ improves success rates by approximately 13-14% across various manipulation tasks.

02

The framework captures fine-grained physical and robotic dynamics effectively.

03

It enables adaptive switching between reasoning and acting during deployment.

Abstract

Vision-Language-Action (VLA) models have recently shown strong generalization, with some approaches seeking to explicitly generate linguistic reasoning traces or predict future observations prior to execution. However, explicit reasoning typically incurs non-negligible inference latency, which constrains the temporal resolution required for robotic manipulation. Moreover, such reasoning is confined to the linguistic space, imposing a representational bottleneck that struggles to faithfully capture ineffable physical attributes. To mitigate these limitations, we propose LaST $_{0}$ , a framework that enables efficient reasoning before acting through a Latent Spatio-Temporal Chain-of-Thought (CoT), capturing fine-grained physical and robotic dynamics that are often difficult to verbalize. Specifically, we introduce a token-efficient latent CoT space that models future visual dynamics, 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.