Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model
Tao Lin, Yuxin Du, Jiting Liu, Nuobei Zhu, Yunhe Li, Yuqian Fu, Yinxinyu Chen, Hongyi Cai, Zewei Ye, Bing Cheng, Kai Ye, Yiran Mao, Yilei Zhong, MingKang Dong, Junchi Yan, Gen Li, Bo Zhao

TL;DR
Evo-Depth introduces a lightweight depth-enhanced vision-language-action model that improves robotic manipulation by integrating implicit depth features from RGB images without extra sensors, achieving high performance with low resource usage.
Contribution
It proposes a novel, efficient depth encoding and spatial enhancement framework that enhances VLA models without additional hardware or large models.
Findings
Evo-Depth outperforms existing models on four simulation benchmarks.
It achieves the highest success rate in real-world experiments.
The model is compact, with only 0.9B parameters, and uses less GPU memory.
Abstract
Vision-Language-Action models have emerged as a promising paradigm for robotic manipulation by unifying perception, language grounding, and action generation. However, they often struggle in scenarios requiring precise spatial understanding, as current VLA models primarily rely on 2D visual representations that lack depth information and detailed spatial relationships. While recent approaches incorporate explicit 3D inputs such as depth maps or point clouds to address this issue, they often increase system complexity, require additional sensors, and remain vulnerable to sensing noise and reconstruction errors. Another line of work explores implicit 3D-aware spatial modeling directly from RGB observations without extra sensors, but it often relies on large geometry foundation models, resulting in higher training and deployment costs. To address these challenges, we propose Evo-Depth, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
