From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

Sining Ang; Yuguang Yang; Chenxu Dang; Canyu Chen; Cheng Chi; Haiyan Liu; Xuanyao Mao; Jason Bao; Xuliang; Bingchuan Sun; Yan Wang

arXiv:2602.10719·cs.RO·February 12, 2026

From Representational Complementarity to Dual Systems: Synergizing VLM and Vision-Only Backbones for End-to-End Driving

Sining Ang, Yuguang Yang, Chenxu Dang, Canyu Chen, Cheng Chi, Haiyan Liu, Xuanyao Mao, Jason Bao, Xuliang, Bingchuan Sun, Yan Wang

PDF

Open Access

TL;DR

This paper investigates how vision-language models (VLM) and vision-only backbones can be combined for end-to-end driving, revealing their complementary behaviors and proposing a hybrid system that improves decision accuracy and efficiency.

Contribution

It introduces a systematic analysis of VLM and vision-only backbones in driving, and proposes HybridDriveVLA and DualDriveVLA systems that leverage their complementarity for better performance.

Findings

01

VLM introduces additional subspaces in the feature space.

02

VLM tends to be more aggressive in long-tail scenarios.

03

HybridDriveVLA achieves 92.10 PDMS, and DualDriveVLA improves throughput by 3.2x.

Abstract

Vision-Language-Action (VLA) driving augments end-to-end (E2E) planning with language-enabled backbones, yet it remains unclear what changes beyond the usual accuracy--cost trade-off. We revisit this question with 3--RQ analysis in RecogDrive by instantiating the system with a full VLM and vision-only backbones, all under an identical diffusion Transformer planner. RQ1: At the backbone level, the VLM can introduce additional subspaces upon the vision-only backbones. RQ2: This unique subspace leads to a different behavioral in some long-tail scenario: the VLM tends to be more aggressive whereas ViT is more conservative, and each decisively wins on about 2--3% of test scenarios; With an oracle that selects, per scenario, the better trajectory between the VLM and ViT branches, we obtain an upper bound of 93.58 PDMS. RQ3: To fully harness this observation, we propose HybridDriveVLA, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Autonomous Vehicle Technology and Safety · Reinforcement Learning in Robotics