TL;DR
This paper systematically compares four asynchronous inference methods for vision-language-action models, evaluating their effectiveness and robustness across benchmarks with controlled experiments.
Contribution
It introduces unified codebases for fair comparison and benchmarks four methods, revealing A2C2's residual correction as most effective and TT-RTC's robustness across delays.
Findings
A2C2 achieves over 90% solve rate up to delay 8 on Kinetix.
IT-RTC performs well at low delays but degrades with longer chunks.
TT-RTC is the most robust and generalizes beyond training delays.
Abstract
Vision-Language-Action (VLA) models offer a promising path to generalist robot control, but their inference latency causes observation staleness when generated actions are executed asynchronously. Several methods have been proposed concurrently to mitigate this problem: inference-time inpainting (IT-RTC), training-time delay simulation (TT-RTC), future-state-aware conditioning (VLASH), and lightweight residual correction (A2C2). Each takes a fundamentally different approach, but they have so far been evaluated independently with different codebases, base policies, and protocols. We present a systematic comparison of these four methods under controlled conditions. We develop two unified codebases that integrate all methods with harmonized library and dataset versions, and we benchmark them on the Kinetix suite with MLPMixer policies and on the LIBERO manipulation benchmark with SmolVLA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
