ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation
Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, Yanwei Fu

TL;DR
ActiveVLA introduces an active perception framework for robotic manipulation that dynamically selects viewpoints and zooms into critical regions, significantly improving precision and adaptability in complex tasks.
Contribution
This work presents ActiveVLA, a novel framework integrating active perception with vision-language-action models for enhanced 3D manipulation accuracy.
Findings
Outperforms state-of-the-art baselines in simulation benchmarks
Enables high-precision manipulation in complex environments
Successfully transfers to real-world robotic tasks
Abstract
Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization
