ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

Zhenyang Liu; Yongchong Gu; Yikai Wang; Xiangyang Xue; Yanwei Fu

arXiv:2601.08325·cs.RO·January 14, 2026

ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

Zhenyang Liu, Yongchong Gu, Yikai Wang, Xiangyang Xue, Yanwei Fu

PDF

Open Access 1 Models

TL;DR

ActiveVLA introduces an active perception framework for robotic manipulation that dynamically selects viewpoints and zooms into critical regions, significantly improving precision and adaptability in complex tasks.

Contribution

This work presents ActiveVLA, a novel framework integrating active perception with vision-language-action models for enhanced 3D manipulation accuracy.

Findings

01

Outperforms state-of-the-art baselines in simulation benchmarks

02

Enables high-precision manipulation in complex environments

03

Successfully transfers to real-world robotic tasks

Abstract

Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios. To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ZhenyangLiu/ActiveVLA
model· ♡ 2
♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization