StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision

Shengliang Deng; Mi Yan; Yixin Zheng; Jiayi Su; Wenhao Zhang; Xiaoguang Zhao; Heming Cui; Zhizheng Zhang; He Wang

arXiv:2512.21970·cs.RO·December 29, 2025

StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision

Shengliang Deng, Mi Yan, Yixin Zheng, Jiayi Su, Wenhao Zhang, Xiaoguang Zhao, Heming Cui, Zhizheng Zhang, He Wang

PDF

Open Access

TL;DR

StereoVLA introduces a novel model that leverages stereo vision to improve robotic vision-language-action tasks by extracting geometric and semantic features, leading to better spatial perception and robustness.

Contribution

It presents a new Geometric-Semantic Feature Extraction module and an auxiliary Depth Estimation task to enhance stereo vision-based VLA models.

Findings

01

Outperforms baseline models significantly in stereo vision tasks.

02

Demonstrates robustness to camera pose variations.

03

Accelerates convergence through auxiliary depth estimation.

Abstract

Stereo cameras closely mimic human binocular vision, providing rich spatial cues critical for precise robotic manipulation. Despite their advantage, the adoption of stereo vision in vision-language-action models (VLAs) remains underexplored. In this work, we present StereoVLA, a VLA model that leverages rich geometric cues from stereo vision. We propose a novel Geometric-Semantic Feature Extraction module that utilizes vision foundation models to extract and fuse two key features: 1) geometric features from subtle stereo-view differences for spatial perception; 2) semantic-rich features from the monocular view for instruction following. Additionally, we propose an auxiliary Interaction-Region Depth Estimation task to further enhance spatial perception and accelerate model convergence. Extensive experiments show that our approach outperforms baselines by a large margin in diverse tasks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Advanced Vision and Imaging