StereoVGGT: A Training-Free Visual Geometry Transformer for Stereo Vision
Ziyang Chen, Yansong Qu, You Shen, Xuan Cheng, Liujuan Cao

TL;DR
StereoVGGT is a training-free, geometry-aware backbone for stereo vision that leverages a pretrained visual geometry transformer, significantly improving stereo matching performance on benchmarks.
Contribution
It introduces StereoVGGT, a novel training-free feature adjustment pipeline that enhances a pretrained visual geometry transformer for stereo vision tasks.
Findings
StereoVGGT achieved 1st rank on the KITTI benchmark.
The method effectively mitigates geometric degradation during feature extraction.
It demonstrates the benefit of using pretrained 3D priors for stereo vision.
Abstract
Driven by the advancement of 3D devices, stereo vision tasks including stereo matching and stereo conversion have emerged as a critical research frontier. Contemporary stereo vision backbones typically rely on either monocular depth estimation (MDE) models or visual foundation models (VFMs). Crucially, these models are predominantly pretrained without explicit supervision of camera poses. Given that such geometric knowledge is indispensable for stereo vision, the absence of explicit spatial constraints constitutes a significant performance bottleneck for existing architectures. Recognizing that the Visual Geometry Grounded Transformer (VGGT) operates as a foundation model pretrained on extensive 3D priors, including camera poses, we investigate its potential as a robust backbone for stereo vision tasks. Nevertheless, empirical results indicate that its direct application to stereo…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
