SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

Zewei Zhou; Ruining Yang; Xuewei (Tony) Qi; Yiluan Guo; Sherry X. Chen; Tao Feng; Kateryna Pistunova; Yishan Shen; Lili Su; and Jiaqi Ma

arXiv:2604.19710·cs.CV·April 22, 2026

SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

Zewei Zhou, Ruining Yang, Xuewei (Tony) Qi, Yiluan Guo, Sherry X. Chen, Tao Feng, Kateryna Pistunova, Yishan Shen, Lili Su, and Jiaqi Ma

PDF

TL;DR

SpanVLA is an efficient, robust vision-language-action framework for autonomous driving that reduces inference latency and learns from negative-recovery samples using flow-matching and reasoning datasets.

Contribution

It introduces a novel end-to-end autonomous driving model with flow-matching planning and a post-training method for robustness, along with a new reasoning dataset.

Findings

01

Significantly reduces inference time in action planning.

02

Improves robustness by learning from negative and recovery behaviors.

03

Demonstrates competitive performance on NAVSIM datasets.

Abstract

Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.