StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

Yiran Shi; Dongqi Guo; Tianchen Zhao; Feng Gao; Liangzhi Shi; Chao Yu; ZhiJian Mo; Qihua Xiao; XiaoShuai Peng; Qingmin Liao; Yu Wang

arXiv:2603.28565·cs.RO·March 31, 2026

StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

Yiran Shi, Dongqi Guo, Tianchen Zhao, Feng Gao, Liangzhi Shi, Chao Yu, ZhiJian Mo, Qihua Xiao, XiaoShuai Peng, Qingmin Liao, Yu Wang

PDF

TL;DR

StreamingVLA introduces a streaming approach to vision-language-action models, enabling asynchronous processing and action flow matching to significantly reduce latency and improve execution fluency in resource-constrained environments.

Contribution

It proposes a novel streaming framework with action flow matching and adaptive observation, enabling faster, more fluent VLA model execution without performance loss.

Findings

01

Achieves 2.4× latency speedup

02

Reduces execution halting by 6.5×

03

Overlaps latency of action generation, execution, and observation

Abstract

Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a "streaming" manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.