GeoVLA: Empowering 3D Representations in Vision-Language-Action Models

Lin Sun; Bin Xie; Yingfei Liu; Hao Shi; Tiancai Wang; Jiale Cao

arXiv:2508.09071·cs.RO·August 14, 2025

GeoVLA: Empowering 3D Representations in Vision-Language-Action Models

Lin Sun, Bin Xie, Yingfei Liu, Hao Shi, Tiancai Wang, Jiale Cao

PDF

Open Access

TL;DR

GeoVLA introduces a 3D-aware vision-language-action framework for robotics, significantly improving spatial understanding and manipulation capabilities by integrating 3D geometric data with language and visual inputs.

Contribution

It is the first to effectively incorporate 3D geometric information into VLA models, enhancing robotic manipulation and spatial awareness.

Findings

01

Achieves state-of-the-art results on LIBERO and ManiSkill2 benchmarks.

02

Demonstrates robustness in real-world tasks with height and scale variations.

03

Outperforms existing models in simulation and real-world environments.

Abstract

Vision-Language-Action (VLA) models have emerged as a promising approach for enabling robots to follow language instructions and predict corresponding actions. However, current VLA models mainly rely on 2D visual inputs, neglecting the rich geometric information in the 3D physical world, which limits their spatial awareness and adaptability. In this paper, we present GeoVLA, a novel VLA framework that effectively integrates 3D information to advance robotic manipulation. It uses a vision-language model (VLM) to process images and language instructions,extracting fused vision-language embeddings. In parallel, it converts depth maps into point clouds and employs a customized point encoder, called Point Embedding Network, to generate 3D geometric embeddings independently. These produced embeddings are then concatenated and processed by our proposed spatial-aware action expert, called…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI