Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models
Xudong Tan, Yaoxin Yang, Peng Ye, Jialin Zheng, Bizhe Bai, Xinyi Wang, Jia Hao, Tao Chen

TL;DR
This paper introduces FlashVLA, a training-free framework that significantly accelerates vision-language-action model inference by reusing actions and pruning visual tokens, achieving over 55% reduction in FLOPs with minimal accuracy loss.
Contribution
The paper presents FlashVLA, a novel plug-and-play method for reducing inference costs in VLA models through token-aware action reuse and visual token pruning, without retraining.
Findings
Reduces FLOPs by 55.7% on LIBERO benchmark.
Decreases latency by 36.0%.
Maintains 99.3% of task success rate.
Abstract
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions. However, their high inference cost-stemming from large-scale token computation and autoregressive decoding-poses significant challenges for real-time deployment and edge applications. While prior work has primarily focused on architectural optimization, we take a different perspective by identifying a dual form of redundancy in VLA models: (i) high similarity across consecutive action steps, and (ii) substantial redundancy in visual tokens. Motivated by these observations, we propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models. FlashVLA improves inference efficiency through a token-aware action reuse mechanism that avoids redundant decoding across stable action steps, and an…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- FLASHVLA can be directly plugged into existing VLA models (e.g., OpenVLA, UniVLA) without retraining, which makes it both practical and reproducible. - The extensive ablation studies on # of Tokens, different modules and diverse VLA architectures make the results convincing.
- Both the similarity across consecutive action steps and across visual tokens have been well-studied. “The first training-free and plug-and-play acceleration framework that enables action reuse in VLA models” also needs more justification. - The method's hyperparameters appear to require careful, per-setting tuning, which may limit reproducibility. Specifically, the parameter $\delta$, which controls the token set stability threshold $\epsilon_2$, is set to different values for each token budg
1. Substantial reduction in FLOPs and inference latency without additional fine-tuning; 2. The method is straightforward and can be integrated with models that use Flash Attention for inference.
All the experiments are conducted on simulation manipulation benchmarks (LIBERO, VLABench). It lacks validation on tasks that involve highly dynamic actions (requiring frequent and rapid changes in actuators) and rapidly changing visual scenes (with significant perturbations in objects and background).
- Proposes a simple yet effective framework to reduce redundant computations in VLA inference. - Training-free and compatible with FlashAttention, enabling easy integration into existing models. - Demonstrates strong empirical results on multiple benchmarks. - Includes detailed ablation and sensitivity analyses.
- The approach primarily combines known concepts (token pruning, reuse heuristics) without strong theoretical advancement. - The method is evaluated mostly in simulated environments; real-robot deployment results are missing. - Performance may depend on manually tuned thresholds; no adaptive mechanism is proposed. - Works such as VLA-Cache [1], TinyVLA [2], EfficientVLA [3] should be discussed and experimentally compared. - Some related token pruning methods should also be discussed and compa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Semantic Web and Ontologies
