TL;DR
FSDrive introduces a visual spatio-temporal Chain-of-Thought framework for autonomous driving, generating future scene representations to enhance planning, perception, and safety in end-to-end vision-language-action models.
Contribution
The paper presents a novel visual CoT approach that predicts future scene states, bridging perception and planning in autonomous driving models.
Findings
Improves trajectory accuracy and reduces collisions on nuScenes and NAVSIM datasets.
Achieves competitive video generation quality with a lightweight autoregressive model.
Enhances scene understanding on DriveLM.
Abstract
Vision-Language-Action (VLA) models offer significant potential for end-to-end driving, yet their reasoning is often constrained by textual Chains-of-Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained cues. We introduce FSDrive, a framework that empowers VLAs to "think visually" using a novel visual spatio-temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse-dynamics model to plan trajectories conditioned on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
