SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

Mengzhen Liu; Enshen Zhou; Cheng Chi; Yi Han; Shanyu Rong; Liming Chen; Pengwei Wang; Zhongyuan Wang; and Shanghang Zhang

arXiv:2603.12193·cs.RO·March 13, 2026

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

Mengzhen Liu, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong, Liming Chen, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang

PDF

Open Access

TL;DR

SaPaVe introduces a unified, data-efficient framework for active perception and manipulation in robotics, leveraging decoupled actions, a large-scale dataset, and a geometry-aware module to improve robustness and success rates.

Contribution

The paper presents SaPaVe, a novel end-to-end framework that decouples perception and manipulation actions, introduces new datasets and benchmarks, and demonstrates superior performance in real-world robotic tasks.

Findings

01

SaPaVe achieves up to 31.25% higher success rates in real-world tasks.

02

Decoupled action learning enhances robustness and generalization.

03

The framework outperforms recent vision-language-action models.

Abstract

Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Soft Robotics and Applications