PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction
Shizhe Chen, Paul Pacaud, Cordelia Schmid

TL;DR
PointACT introduces a 3D-aware vision-language-action model that integrates hierarchical point cloud data with pretrained visual representations, significantly improving robotic manipulation success rates.
Contribution
It proposes a novel dual-system 3D-aware VLA policy with multi-scale point-action interaction, enhancing spatial reasoning for robotic manipulation.
Findings
Achieves 10% higher success rates on RLBench-10Tasks benchmark.
Outperforms state-of-the-art pretrained VLAs across benchmarks.
Tightly coupling 3D geometry with 2D semantic info is crucial for robustness.
Abstract
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation by leveraging large pretrained vision-language backbones. However, most existing VLAs rely primarily on 2D visual representations, which limit their ability to reason about fine-grained geometry and spatial grounding - capabilities that are essential for precise and robust manipulation in 3D environments. In this paper, we propose PointACT, a dual-system 3D-aware VLA policy that integrates hierarchical 3D point cloud representations directly into the action decoding process. PointACT employs a multi-scale point-action interaction mechanism with efficient bottleneck window self-attention, enabling evolving action tokens to densely attend to both local geometric detail and global scene structure. We evaluate PointACT on the LIBERO and RLBench benchmarks and systematically compare it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
