PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation

Qingyu Fan; Zhaoxiang Li; Yi Lu; Wang Chen; Qiu Shen; Xiao-xiao Long; Yinghao Cai; Tao Lu; Shuo Wang; Xun Cao

arXiv:2601.17885·cs.CV·January 27, 2026

PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation

Qingyu Fan, Zhaoxiang Li, Yi Lu, Wang Chen, Qiu Shen, Xiao-xiao Long, Yinghao Cai, Tao Lu, Shuo Wang, Xun Cao

PDF

Open Access

TL;DR

PEAfowl introduces a perception-enhanced multi-view vision-language-action policy that improves bimanual manipulation by incorporating 3D spatial reasoning, iterative instruction grounding, and depth distillation, leading to significant success rate improvements.

Contribution

The paper presents a novel multi-view VLA policy with 3D spatial reasoning, iterative instruction grounding, and depth distillation, enhancing generalization and robustness in bimanual manipulation tasks.

Findings

01

23.0 percentage points success rate improvement over baseline

02

Effective sim-to-real transfer demonstrated on real robots

03

Depth distillation enhances perception accuracy without inference overhead

Abstract

Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding. In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA policy for bimanual manipulation. For spatial reasoning, PEAfowl predicts per-token depth distributions, performs differentiable 3D lifting, and aggregates local cross-view neighbors to form geometrically grounded, cross-view consistent representations. For instruction grounding, we propose to replace global conditioning with a Perceiver-style text-aware readout over frozen CLIP visual features,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition