FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng; Xinyuan Chang; Mengwei Xie; Xinran Liu; Yifan Bai; Zheng Pan; Mu Xu; Xing Wei; Ning Guo

arXiv:2505.17685·cs.CV·November 12, 2025

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, Ning Guo

PDF

1 Models

TL;DR

FSDrive introduces a visual spatio-temporal Chain-of-Thought framework for autonomous driving, generating future scene representations to enhance planning, perception, and safety in end-to-end vision-language-action models.

Contribution

The paper presents a novel visual CoT approach that predicts future scene states, bridging perception and planning in autonomous driving models.

Findings

01

Improves trajectory accuracy and reduces collisions on nuScenes and NAVSIM datasets.

02

Achieves competitive video generation quality with a lightweight autoregressive model.

03

Enhances scene understanding on DriveLM.

Abstract

Vision-Language-Action (VLA) models offer significant potential for end-to-end driving, yet their reasoning is often constrained by textual Chains-of-Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained cues. We introduce FSDrive, a framework that empowers VLAs to "think visually" using a novel visual spatio-temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse-dynamics model to plan trajectories conditioned on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Paipai-zxa/EvoDriveVLA
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.