Running VLAs at Real-time Speed

Yunchao Ma; Yizhuang Zhou; Yunhuan Yang; Tiancai Wang; Haoqiang Fan

arXiv:2510.26742·cs.RO·October 31, 2025

Running VLAs at Real-time Speed

Yunchao Ma, Yizhuang Zhou, Yunhuan Yang, Tiancai Wang, Haoqiang Fan

PDF

TL;DR

This paper demonstrates how to run large multi-view visual language models at real-time speeds on a consumer GPU, enabling dynamic tasks like robot grasping that were previously infeasible.

Contribution

It introduces strategies to eliminate inference overheads, achieving real-time performance of large VLA models on consumer hardware, and proposes a streaming inference framework for robot control.

Findings

01

Achieved 30Hz frame rate and 480Hz trajectory frequency with a single GPU.

02

Attained 100% success rate in a real-world grasping task.

03

Provided a streaming inference framework for real-time robot applications.

Abstract

In this paper, we show how to run pi0-level multi-view VLA at 30Hz frame rate and at most 480Hz trajectory frequency using a single consumer GPU. This enables dynamic and real-time tasks that were previously believed to be unattainable by large VLA models. To achieve it, we introduce a bag of strategies to eliminate the overheads in model inference. The real-world experiment shows that the pi0 policy with our strategy achieves a 100% success rate in grasping a falling pen task. Based on the results, we further propose a full streaming inference framework for real-time robot control of VLA. Code is available at https://github.com/Dexmal/realtime-vla.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.