MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement

Haoyu Zhang; Jingyi Zhou; Peng Ye; Jiakang Yuan; Lin Zhang; Feng Xu; Tao Chen

arXiv:2604.20393·cs.CV·April 23, 2026

MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement

Haoyu Zhang, Jingyi Zhou, Peng Ye, Jiakang Yuan, Lin Zhang, Feng Xu, Tao Chen

PDF

TL;DR

MLG-Stereo introduces a ViT-based stereo matching method that enhances local detail handling and resolution robustness through multi-stage local-global features and iterative disparity optimization.

Contribution

It proposes a systematic pipeline extending global modeling beyond the encoder, including a multi-granularity feature network and local-global cost volume for improved accuracy.

Findings

01

Achieves state-of-the-art results on Middlebury and KITTI-2015 datasets.

02

Demonstrates robustness to arbitrary resolution images.

03

Outperforms existing ViT-based stereo matching methods.

Abstract

With the development of deep learning, ViT-based stereo matching methods have made significant progress due to their remarkable robustness and zero-shot ability. However, due to the limitations of ViTs in handling resolution sensitivity and their relative neglect of local information, the ability of ViT-based methods to predict details and handle arbitrary-resolution images is still weaker than that of CNN-based methods. To address these shortcomings, we propose MLG-Stereo, a systematic pipeline-level design that extends global modeling beyond the encoder stage. First, we propose a Multi-Granularity Feature Network to effectively balance global context and local geometric information, enabling comprehensive feature extraction from images of arbitrary resolution and bridging the gap between training and inference scales. Then, a Local-Global Cost Volume is constructed to capture both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.