Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving

Hao Jiang; Chuan Hu; Yukang Shi; Yuan He; Ke Wang; Xi Zhang; Zhipeng Zhang

arXiv:2506.05442·cs.CV·May 19, 2026

Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving

Hao Jiang, Chuan Hu, Yukang Shi, Yuan He, Ke Wang, Xi Zhang, Zhipeng Zhang

PDF

TL;DR

This paper introduces a structured dataset and a compact vision-language model for autonomous driving, significantly improving decision accuracy and inference speed by leveraging structured scene descriptions.

Contribution

The paper presents NuScenes-S, a structured dataset, and FastDrive, a lightweight VLM, enhancing efficiency and decision-making in autonomous driving applications.

Findings

01

FastDrive achieves 20% higher decision accuracy.

02

Over 10x inference speedup compared to larger models.

03

Scene annotations significantly impact decision-making performance.

Abstract

Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Autonomous Vehicle Technology and Safety

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus