BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

Peiyan Li; Yixiang Chen; Hongtao Wu; Xiao Ma; Xiangnan Wu; Yan Huang; Liang Wang; Tao Kong; Tieniu Tan

arXiv:2506.07961·cs.RO·October 15, 2025

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, Tieniu Tan

PDF

Open Access 5 Datasets

TL;DR

BridgeVLA introduces a novel approach for 3D robot manipulation by aligning 3D inputs with vision-language models through 2D projections and heatmaps, significantly improving sample efficiency and performance.

Contribution

The paper presents a new 3D VLA model that aligns 3D data with VLMs using 2D projections and heatmaps, along with a scalable pre-training method for enhanced manipulation learning.

Findings

01

Outperforms state-of-the-art methods in simulation benchmarks.

02

Achieves high success rates with minimal training data.

03

Demonstrates robust generalization in real-robot experiments.

Abstract

Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Neural Network Applications