Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving
Hao Jiang, Chuan Hu, Yukang Shi, Yuan He, Ke Wang, Xi Zhang, Zhipeng Zhang

TL;DR
This paper introduces a structured dataset and a compact vision-language model for autonomous driving, significantly improving decision accuracy and inference speed by leveraging structured scene descriptions.
Contribution
The paper presents NuScenes-S, a structured dataset, and FastDrive, a lightweight VLM, enhancing efficiency and decision-making in autonomous driving applications.
Findings
FastDrive achieves 20% higher decision accuracy.
Over 10x inference speedup compared to larger models.
Scene annotations significantly impact decision-making performance.
Abstract
Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Autonomous Vehicle Technology and Safety
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Focus
