Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation

Juntao Gao; Feiyang Ye; Jing Zhang; Wenjing Qian

arXiv:2511.18950·cs.RO·November 25, 2025

Compressor-VLA: Instruction-Guided Visual Token Compression for Efficient Robotic Manipulation

Juntao Gao, Feiyang Ye, Jing Zhang, Wenjing Qian

PDF

Open Access

TL;DR

Compressor-VLA introduces an instruction-guided visual token compression framework that significantly reduces computational costs in embodied AI models while maintaining task performance, enabling efficient real-time robotic manipulation.

Contribution

It presents a novel hybrid, instruction-conditioned token compression method combining holistic and spatial detail preservation for VLA models.

Findings

01

Achieves 59% reduction in FLOPs and over 3x fewer visual tokens.

02

Maintains competitive success rates on LIBERO benchmark.

03

Demonstrates effective sim-to-real transfer on dual-arm robots.

Abstract

Vision-Language-Action (VLA) models have emerged as a powerful paradigm in Embodied AI. However, the significant computational overhead of processing redundant visual tokens remains a critical bottleneck for real-time robotic deployment. While standard token pruning techniques can alleviate this, these task-agnostic methods struggle to preserve task-critical visual information. To address this challenge, simultaneously preserving both the holistic context and fine-grained details for precise action, we propose Compressor-VLA, a novel hybrid instruction-conditioned token compression framework designed for efficient, task-oriented compression of visual information in VLA models. The proposed Compressor-VLA framework consists of two token compression modules: a Semantic Task Compressor (STC) that distills holistic, task-relevant context, and a Spatial Refinement Compressor (SRC) that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning