VLA Knows Its Limits
Haoxuan Wang, Gengyu Zhang, Yan Yan, Ramana Rao Kompella, Gaowen Liu

TL;DR
This paper investigates the impact of execution horizon in flow-based Vision-Language-Action models, revealing performance fluctuations and proposing a dynamic estimation method, AutoHorizon, to adapt to environmental changes.
Contribution
It introduces AutoHorizon, the first method to dynamically estimate execution horizon at test time, improving adaptability and performance in flow-based VLA models.
Findings
Performance varies with execution horizon, initially improving then declining.
AutoHorizon effectively adapts horizon, enhancing robotic manipulation tasks.
AutoHorizon generalizes across tasks and models with minimal overhead.
Abstract
Action chunking has recently emerged as a standard practice in flow-based Vision-Language-Action (VLA) models. However, the effect and choice of the execution horizon - the number of actions to be executed from each predicted chunk - remains underexplored. In this work, we first show that varying the execution horizon leads to substantial performance deviations, with performance initially improving and then declining as the horizon increases. To uncover the reasons, we analyze the cross- and self-attention weights in flow-based VLAs and reveal two key phenomena: (i) intra-chunk actions attend invariantly to vision-language tokens, limiting adaptability to environmental changes; and (ii) the initial and terminal action tokens serve as stable anchors, forming latent centers around which intermediate actions are organized. Motivated by these insights, we interpret action self-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
