Token Expand-Merge: Training-Free Token Compression for Vision-Language-Action Models
Yifan Ye, Jiaqi Ma, Jun Cen, Zhihe Lu

TL;DR
TEAM-VLA is a training-free token compression method that accelerates vision-language-action models by dynamically expanding and merging tokens, improving inference speed without retraining.
Contribution
It introduces a novel training-free token expansion and merging framework for VLA models, enhancing efficiency while preserving performance.
Findings
Significantly speeds up inference on LIBERO benchmark
Maintains or improves task success rates compared to full models
Operates without retraining or parameter updates
Abstract
Vision-Language-Action (VLA) models pretrained on large-scale multimodal datasets have emerged as powerful foundations for robotic perception and control. However, their massive scale, often billions of parameters, poses significant challenges for real-time deployment, as inference becomes computationally expensive and latency-sensitive in dynamic environments. To address this, we propose Token Expand-and-Merge-VLA (TEAM-VLA), a training-free token compression framework that accelerates VLA inference while preserving task performance. TEAM-VLA introduces a dynamic token expansion mechanism that identifies and samples additional informative tokens in the spatial vicinity of attention-highlighted regions, enhancing contextual completeness. These expanded tokens are then selectively merged in deeper layers under action-aware guidance, effectively reducing redundancy while maintaining…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
