TL;DR
This paper presents practical techniques to deploy Visual-Language Agents (VLAs) on real robots, achieving high-speed, accurate, and smooth operation comparable to human performance.
Contribution
It introduces a comprehensive set of methods across calibration, planning, control, and learning to enable real-time, high-performance VLA deployment on robots.
Findings
Robots execute tasks at speeds comparable to casual human operation.
The system achieves a balance of speed, accuracy, and dexterity.
Inference traces and videos demonstrate real-world effectiveness.
Abstract
In deployment of the VLA models to real-world robotic tasks, execution speed matters. In previous work arXiv:2510.26742 we analyze how to make neural computation of VLAs on GPU fast. However, we leave the question of how to actually deploy the VLA system on the real robots open. In this report we describe a set of practical techniques to achieve the end-to-end result of running a VLA-driven robot at an impressive speed in real world tasks that require both accuracy and dexterity. The stack of technology ranges across calibration, planning & control, and learning based method to identify optimal execution speed. In the tasks we show, the robot even executes in a speed on par with casual human operation and approaching the hardware limit of our lightweight arm. The unaccelerated videos and inference traces are provided in https://dexmal.github.io/realtime-vla-v2/.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
