LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

Yuechen Luo; Fang Li; Shaoqing Xu; Yang Ji; Zehan Zhang; Bing Wang; Yuannan Shen; Jianwei Cui; Long Chen; Guang Chen; Hangjun Ye; Zhi-Xin Yang; Fuxi Wen

arXiv:2603.01928·cs.CV·March 13, 2026

LaST-VLA: Thinking in Latent Spatio-Temporal Space for Vision-Language-Action in Autonomous Driving

Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, Hangjun Ye, Zhi-Xin Yang, Fuxi Wen

PDF

Open Access

TL;DR

LaST-VLA introduces a physically grounded latent reasoning framework for autonomous driving, improving spatial-temporal understanding and safety by integrating geometric constraints and dynamic foresight into a unified model.

Contribution

It proposes a novel Latent Spatio-Temporal Chain-of-Thought approach that incorporates geometric and dynamic constraints into latent reasoning for autonomous driving.

Findings

01

Achieved new state-of-the-art on NAVSIM v1 and v2 benchmarks.

02

Excelled in spatial-temporal reasoning on SURDS and NuDynamics.

03

Enhanced safety and rule compliance through reinforcement learning.

Abstract

While Vision-Language-Action (VLA) models have revolutionized autonomous driving by unifying perception and planning, their reliance on explicit textual Chain-of-Thought (CoT) leads to semantic-perceptual decoupling and perceptual-symbolic conflicts. Recent shifts toward latent reasoning attempt to bypass these bottlenecks by thinking in continuous hidden space. However, without explicit intermediate constraints, standard latent CoT often operates as a physics-agnostic representation. To address this, we propose the Latent Spatio-Temporal VLA (LaST-VLA), a framework shifting the reasoning paradigm from discrete symbolic processing into a physically grounded Latent Spatio-Temporal CoT. By implementing a dual-feature alignment mechanism, we distill geometric constraints from 3D foundation models and dynamic foresight from world models directly into the latent space. Coupled with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Reinforcement Learning in Robotics