LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model
Xiaodong Mei, Diankun Zhang, Hongwei Xie, Guang Chen, Hangjun Ye, Dan Xu

TL;DR
LVDrive introduces a latent space future scene prediction task into vision-language-action models, significantly improving autonomous driving performance by enhancing scene understanding and reasoning.
Contribution
It proposes a novel latent space future scene prediction framework with a unified embedding and a two-stage trajectory decoding strategy for autonomous driving.
Findings
LVDrive outperforms existing methods on the Bench2Drive benchmark.
The model achieves significant improvements in closed-loop driving performance.
Joint modeling of future scene and motion prediction enhances reasoning capabilities.
Abstract
Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
