MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
Yuzhou Huang, Benjin Zhu, Hengtong Lu, Victor Shea-Jay Huang, Haiming Zhang, Wei Chen, Jifeng Dai, Yan Xie, Hongsheng Li

TL;DR
MindVLA-U1 introduces a unified streaming VLA architecture for autonomous driving that outperforms human drivers and state-of-the-art models by integrating semantic reasoning, temporal context, and continuous control in a single, efficient system.
Contribution
This paper presents the first unified streaming VLA architecture that processes driving video and language commands in a single pass, improving planning accuracy and latency.
Findings
Surpasses experienced human drivers on WOD-E2E benchmark with 8.20 RFS.
Achieves state-of-the-art planning ADEs over prior VA/VLA models.
Matches VA latency at 16 FPS while maintaining natural language interfaces.
Abstract
Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces AR language tokens (optional) and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
