AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

Yuxuan Han; Kunyuan Wu; Qianyi Shao; Renxiang Xiao; Zilu Wang; Cansen Jiang; Yi Xiao; Liang Hu; Yunjiang Lou

arXiv:2602.04256·cs.RO·February 5, 2026

AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

Yuxuan Han, Kunyuan Wu, Qianyi Shao, Renxiang Xiao, Zilu Wang, Cansen Jiang, Yi Xiao, Liang Hu, Yunjiang Lou

PDF

Open Access

TL;DR

AppleVLM is a comprehensive vision-language model for autonomous driving that integrates advanced perception and planning modules, demonstrating superior performance in simulation and real-world environments by addressing perception and decision-making challenges.

Contribution

The paper introduces AppleVLM, a novel end-to-end autonomous driving model with a deformable transformer-based vision encoder and a planning modality, improving robustness and generalization over prior VLM approaches.

Findings

01

Achieved state-of-the-art performance on CARLA benchmarks.

02

Successfully deployed on a real AGV platform for outdoor driving.

03

Enhanced perception robustness through multi-view, multi-timestep fusion.

Abstract

End-to-end autonomous driving has emerged as a promising paradigm integrating perception, decision-making, and control within a unified learning framework. Recently, Vision-Language Models (VLMs) have gained significant attention for their potential to enhance the robustness and generalization of end-to-end driving models in diverse and unseen scenarios. However, existing VLM-based approaches still face challenges, including suboptimal lane perception, language understanding biases, and difficulties in handling corner cases. To address these issues, we propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. Firstly, the vision encoder fuses spatial-temporal information from multi-view images across multiple timesteps using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutonomous Vehicle Technology and Safety · Advanced Neural Network Applications · Multimodal Machine Learning Applications