TL;DR
3D-CAVLA is a finetuning framework that improves vision-language action models for robotic manipulation by integrating depth perception, structured reasoning, and focused region detection, leading to better generalization and efficiency.
Contribution
The paper introduces 3D-CAVLA, a novel approach that enhances task generalization of VLA policies through depth-aware perception, reasoning, and region detection, with extensive simulation and real-world validation.
Findings
Achieves 98.1% success rate on diverse in-domain tasks.
Improves success rate by 8.8% on unseen tasks.
Over 3X faster training convergence and 25% gain on real-world tasks.
Abstract
Robotic manipulation in 3D requires effective computation of N degree-of-freedom joint-space trajectories that enable precise and robust control. To achieve this, robots must integrate semantic understanding with visual perception to transform real-world observations into low-level control for object interaction. Recent advances in Vision-Language-Action (VLA) models have shown promise by mapping RGB images and language instructions to task space velocities, typically trained on large datasets of teleoperated demonstrations. However, these models often struggle with generalization beyond their training distributions. In this work, we introduce 3D-CAVLA, a novel finetuning framework that enhances task generalization of VLA policies by incorporating three key components: (i) chain-of-thought reasoning for structured decision-making, (ii) depth-aware perception for 3D spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
