3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks

Vineet Bhat; Yu-Hsiang Lan; Prashanth Krishnamurthy; Ramesh Karri; Farshad Khorrami

arXiv:2505.05800·cs.RO·March 31, 2026

3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks

Vineet Bhat, Yu-Hsiang Lan, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami

PDF

1 Repo

TL;DR

3D-CAVLA is a finetuning framework that improves vision-language action models for robotic manipulation by integrating depth perception, structured reasoning, and focused region detection, leading to better generalization and efficiency.

Contribution

The paper introduces 3D-CAVLA, a novel approach that enhances task generalization of VLA policies through depth-aware perception, reasoning, and region detection, with extensive simulation and real-world validation.

Findings

01

Achieves 98.1% success rate on diverse in-domain tasks.

02

Improves success rate by 8.8% on unseen tasks.

03

Over 3X faster training convergence and 25% gain on real-world tasks.

Abstract

Robotic manipulation in 3D requires effective computation of N degree-of-freedom joint-space trajectories that enable precise and robust control. To achieve this, robots must integrate semantic understanding with visual perception to transform real-world observations into low-level control for object interaction. Recent advances in Vision-Language-Action (VLA) models have shown promise by mapping RGB images and language instructions to task space velocities, typically trained on large datasets of teleoperated demonstrations. However, these models often struggle with generalization beyond their training distributions. In this work, we introduce 3D-CAVLA, a novel finetuning framework that enhances task generalization of VLA policies by incorporating three key components: (i) chain-of-thought reasoning for structured decision-making, (ii) depth-aware perception for 3D spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://3d-cavla.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.