Understanding Asynchronous Inference Methods for Vision-Language-Action Models

Ayoub Agouzoul

arXiv:2605.08168·cs.RO·May 12, 2026

Understanding Asynchronous Inference Methods for Vision-Language-Action Models

Ayoub Agouzoul

PDF

1 Repo

TL;DR

This paper systematically compares four asynchronous inference methods for vision-language-action models, evaluating their effectiveness and robustness across benchmarks with controlled experiments.

Contribution

It introduces unified codebases for fair comparison and benchmarks four methods, revealing A2C2's residual correction as most effective and TT-RTC's robustness across delays.

Findings

01

A2C2 achieves over 90% solve rate up to delay 8 on Kinetix.

02

IT-RTC performs well at low delays but degrades with longer chunks.

03

TT-RTC is the most robust and generalizes beyond training delays.

Abstract

Vision-Language-Action (VLA) models offer a promising path to generalist robot control, but their inference latency causes observation staleness when generated actions are executed asynchronously. Several methods have been proposed concurrently to mitigate this problem: inference-time inpainting (IT-RTC), training-time delay simulation (TT-RTC), future-state-aware conditioning (VLASH), and lightweight residual correction (A2C2). Each takes a fundamentally different approach, but they have so far been evaluated independently with different codebases, base policies, and protocols. We present a systematic comparison of these four methods under controlled conditions. We develop two unified codebases that integrate all methods with harmonized library and dataset versions, and we benchmark them on the Kinetix suite with MLPMixer policies and on the LIBERO manipulation benchmark with SmolVLA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

TheAyos/async-vla-inference
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.