LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

Xiaodong Mei; Diankun Zhang; Hongwei Xie; Guang Chen; Hangjun Ye; Dan Xu

arXiv:2605.22089·cs.CV·May 22, 2026

LVDrive: Latent Visual Representation Enhanced Vision-Language-Action Autonomous Driving Model

Xiaodong Mei, Diankun Zhang, Hongwei Xie, Guang Chen, Hangjun Ye, Dan Xu

PDF

TL;DR

LVDrive introduces a latent space future scene prediction task into vision-language-action models, significantly improving autonomous driving performance by enhancing scene understanding and reasoning.

Contribution

It proposes a novel latent space future scene prediction framework with a unified embedding and a two-stage trajectory decoding strategy for autonomous driving.

Findings

01

LVDrive outperforms existing methods on the Bench2Drive benchmark.

02

The model achieves significant improvements in closed-loop driving performance.

03

Joint modeling of future scene and motion prediction enhances reasoning capabilities.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising framework for end-to-end autonomous driving. However, existing VLAs typically rely on sparse action supervision, which underutilizes their powerful scene understanding and reasoning capabilities. Recent attempts to incorporate dense visual supervision via world modeling often overemphasize pixel-level image reconstruction, neglecting semantically meaningful scene representation learning. In this work, we propose LVDrive, a Latent Visual representation enhanced VLA framework for autonomous driving. LVDrive introduces a future scene prediction task into the VLA paradigm, where future representations are learned entirely in a high-level latent space under auxiliary supervision from a pretrained vision backbone. Departing from inefficient autoregressive generation, we jointly model future scene and motion prediction within a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.