MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

Yuzhou Huang; Benjin Zhu; Hengtong Lu; Victor Shea-Jay Huang; Haiming Zhang; Wei Chen; Jifeng Dai; Yan Xie; Hongsheng Li

arXiv:2605.12624·cs.RO·May 15, 2026

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

Yuzhou Huang, Benjin Zhu, Hengtong Lu, Victor Shea-Jay Huang, Haiming Zhang, Wei Chen, Jifeng Dai, Yan Xie, Hongsheng Li

PDF

TL;DR

MindVLA-U1 introduces a unified streaming VLA architecture for autonomous driving that outperforms human drivers and state-of-the-art models by integrating semantic reasoning, temporal context, and continuous control in a single, efficient system.

Contribution

This paper presents the first unified streaming VLA architecture that processes driving video and language commands in a single pass, improving planning accuracy and latency.

Findings

01

Surpasses experienced human drivers on WOD-E2E benchmark with 8.20 RFS.

02

Achieves state-of-the-art planning ADEs over prior VA/VLA models.

03

Matches VA latency at 16 FPS while maintaining natural language interfaces.

Abstract

Autonomous driving has progressed from modular pipelines toward end-to-end unification, and Vision-Language-Action (VLA) models are a natural extension of this journey beyond Vision-to-Action (VA). In practice, driving VLAs have often trailed VA on planning quality, suggesting that the difficulty is not simply model scale but the interface through which semantic reasoning, temporal context, and continuous control are combined. We argue that this gap reflects how VLA has been built -- as isolated subtask improvements that fail to compose coherent driving capabilities -- rather than what VLA is. We present MindVLA-U1, the first unified streaming VLA architecture for autonomous driving. A unified VLM backbone produces AR language tokens (optional) and flow-matching continuous action trajectories in a single forward pass over one shared representation, preserving the natural output form of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.