Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation
Juntao Gao, Feiyang Ye, Jing Zhang, Wenjing Qian

TL;DR
Compressor-VLA introduces an instruction-guided visual token compression framework that significantly reduces computational costs in embodied AI models while maintaining task performance, enabling efficient real-time robotic manipulation.
Contribution
It presents a novel hybrid, instruction-conditioned token compression method combining holistic and spatial detail preservation for VLA models.
Findings
Achieves 59% reduction in FLOPs and over 3x fewer visual tokens.
Maintains competitive success rates on LIBERO benchmark.
Demonstrates effective sim-to-real transfer on dual-arm robots.
Abstract
Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning
