GeoVLA: Empowering 3D Representations in Vision-Language-Action Models
Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, Jiale Cao

TL;DR
GeoVLA introduces a 3D-aware vision-language-action framework for robotics, significantly improving spatial understanding and manipulation capabilities by integrating 3D geometric data with language and visual inputs.
Contribution
It is the first to effectively incorporate 3D geometric information into VLA models, enhancing robotic manipulation and spatial awareness.
Findings
Achieves state-of-the-art results on LIBERO and ManiSkill2 benchmarks.
Demonstrates robustness in real-world tasks with height and scale variations.
Outperforms existing models in simulation and real-world environments.
Abstract
Vision-Language-Action (VLA) models have emerged as a promising approach for enabling robots to follow language instructions and predict corresponding actions. However, current VLA models mainly rely on 2D visual inputs, neglecting the rich geometric information in the 3D physical world, which limits their spatial awareness and adaptability. In this paper, we present GeoVLA, a novel VLA framework that effectively integrates 3D information to advance robotic manipulation. It uses a vision-language model (VLM) to process images and language instructions,extracting fused vision-language embeddings. In parallel, it converts depth maps into point clouds and employs a customized point encoder, called Point Embedding Network, to generate 3D geometric embeddings independently. These produced embeddings are then concatenated and processed by our proposed spatial-aware action expert, called…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
