DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang; Yilin Chai; Xiaosong Jia; Qifeng Li; Yuqian Shao; Xuekai Zhu; Haisheng Su; Junchi Yan

arXiv:2505.16278·cs.CV·May 19, 2026

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan

PDF

1 Repo

TL;DR

DriveMoE introduces a Mixture-of-Experts framework for end-to-end autonomous driving, enhancing scenario handling by dynamically selecting relevant visual cues and behaviors, achieving state-of-the-art results.

Contribution

This work pioneers the integration of Scene-Specialized Vision MoE and Skill-Specialized Action MoE into an end-to-end autonomous driving model, inspired by human cognition.

Findings

01

DriveMoE achieves state-of-the-art performance on Bench2Drive evaluation.

02

Dynamic routing improves handling of diverse and rare driving scenarios.

03

Explicit behavioral specialization prevents modes averaging issues.

Abstract

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_{0}$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive- $π_{0}$ . Specifically, we add Vision MoE to Drive- $π_{0}$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis

MethodsMixture of Experts