VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching
Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, Chang Xu

TL;DR
VLA-Cache is a novel method that accelerates vision-language-action models for robotic control by adaptively caching visual tokens, reducing computation while maintaining accuracy, and enabling faster decision-making in real-time scenarios.
Contribution
The paper introduces VLA-Cache, a training-free inference acceleration technique that adaptively reuses visual tokens based on temporal continuity and attention, significantly improving efficiency in robotic manipulation tasks.
Findings
Achieves up to 1.7x speedup in CUDA latency.
Increases control frequency by 15%.
Maintains task success rate with negligible loss.
Abstract
Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions in an end-to-end manner. However, their substantial computational cost poses a challenge for real-time robotic control, where rapid decision-making is essential. This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames. Exploiting the temporal continuity in robotic manipulation, VLA-Cache identifies minimally changed tokens between adjacent frames and reuses their cached key-value representations, thereby circumventing redundant computations. Additionally, to maintain action precision, VLA-Cache selectively re-computes task-relevant tokens that are environmentally sensitive, ensuring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Robotics and Automated Systems
