Running VLAs at Real-time Speed
Yunchao Ma, Yizhuang Zhou, Yunhuan Yang, Tiancai Wang, Haoqiang Fan

TL;DR
This paper demonstrates how to run large multi-view visual language models at real-time speeds on a consumer GPU, enabling dynamic tasks like robot grasping that were previously infeasible.
Contribution
It introduces strategies to eliminate inference overheads, achieving real-time performance of large VLA models on consumer hardware, and proposes a streaming inference framework for robot control.
Findings
Achieved 30Hz frame rate and 480Hz trajectory frequency with a single GPU.
Attained 100% success rate in a real-world grasping task.
Provided a streaming inference framework for real-time robot applications.
Abstract
In this paper, we show how to run pi0-level multi-view VLA at 30Hz frame rate and at most 480Hz trajectory frequency using a single consumer GPU. This enables dynamic and real-time tasks that were previously believed to be unattainable by large VLA models. To achieve it, we introduce a bag of strategies to eliminate the overheads in model inference. The real-world experiment shows that the pi0 policy with our strategy achieves a 100% success rate in grasping a falling pen task. Based on the results, we further propose a full streaming inference framework for real-time robot control of VLA. Code is available at https://github.com/Dexmal/realtime-vla.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
