AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

Wenhui Huang; Songyan Zhang; Qihang Huang; Zhidong Wang; Zhiqi Mao; Collister Chua; Zhan Chen; Long Chen; Chen Lv

arXiv:2603.14851·cs.CV·May 15, 2026

AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving

Wenhui Huang, Songyan Zhang, Qihang Huang, Zhidong Wang, Zhiqi Mao, Collister Chua, Zhan Chen, Long Chen, Chen Lv

PDF

1 Repo 1 Models 1 Datasets

TL;DR

AutoMoT introduces a unified vision-language-action model with a mixture-of-transformers architecture for efficient, end-to-end autonomous driving, leveraging pre-trained VLMs and asynchronous inference to improve scene understanding and decision-making.

Contribution

It proposes a novel end-to-end AD framework that unifies reasoning and action generation using a mixture-of-transformers with asynchronous execution, enhancing efficiency and reasoning capabilities.

Findings

01

AutoMoT achieves competitive performance on multiple benchmarks.

02

Pre-trained VLMs can handle scene understanding with semantic prompting without fine-tuning.

03

Fine-tuning remains necessary for action-level tasks like decision-making.

Abstract

Integrating vision-language models (VLMs) into end-to-end (E2E) autonomous driving (AD) systems has shown promise in improving scene understanding. However, existing integration strategies suffer from several limitations: they either struggle to resolve distribution misalignment between reasoning and action spaces, underexploit the general reasoning capabilities of pretrained VLMs, or incur substantial inference latency during action policy generation, which degrades driving performance. To address these challenges, we propose AutoMoT in this work, an end-to-end AD framework that unifies reasoning and action generation within a single vision-language-action (VLA) model. Our approach leverages a mixture-of-transformer (MoT) architecture with joint attention sharing, which preserves the general reasoning capabilities of pre-trained VLMs while enabling efficient fast-slow inference through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://automot-website.github.io
github

Models

🤗
Oscar-Huang/AutoMoT
model· 64 dl
64 dl

Datasets

Oscar-Huang/nuSync
dataset· 46 dl
46 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.