TL;DR
DriveMoE introduces a Mixture-of-Experts framework for end-to-end autonomous driving, enhancing scenario handling by dynamically selecting relevant visual cues and behaviors, achieving state-of-the-art results.
Contribution
This work pioneers the integration of Scene-Specialized Vision MoE and Skill-Specialized Action MoE into an end-to-end autonomous driving model, inspired by human cognition.
Findings
DriveMoE achieves state-of-the-art performance on Bench2Drive evaluation.
Dynamic routing improves handling of diverse and rare driving scenarios.
Explicit behavioral specialization prevents modes averaging issues.
Abstract
End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-. Specifically, we add Vision MoE to Drive- by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis
MethodsMixture of Experts
