DEFOM-Stereo: Depth Foundation Model Based Stereo Matching
Hualie Jiang, Zhiqiang Lou, Laiyan Ding, Rui Xu, Minglang Tan, Wenjie, Jiang, Rui Huang

TL;DR
DEFOM-Stereo introduces a novel stereo matching framework that leverages monocular depth cues from foundation models, significantly improving zero-shot generalization and achieving top benchmark performance.
Contribution
The paper presents DEFOM-Stereo, a new stereo matching framework integrating a monocular depth foundation model to enhance robustness and generalization in disparity estimation.
Findings
Outperforms state-of-the-art methods in zero-shot generalization
Achieves top results on KITTI, Middlebury, and ETH3D benchmarks
Outperforms previous models in the robust vision challenge
Abstract
Stereo matching is a key technique for metric depth estimation in computer vision and robotics. Real-world challenges like occlusion and non-texture hinder accurate disparity estimation from binocular matching cues. Recently, monocular relative depth estimation has shown remarkable generalization using vision foundation models. Thus, to facilitate robust stereo matching with monocular depth cues, we incorporate a robust monocular relative depth model into the recurrent stereo-matching framework, building a new framework for depth foundation model-based stereo-matching, DEFOM-Stereo. In the feature extraction stage, we construct the combined context and matching feature encoder by integrating features from conventional CNNs and DEFOM. In the update stage, we use the depth predicted by DEFOM to initialize the recurrent disparity and introduce a scale update module to refine the disparity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSatellite Image Processing and Photogrammetry · Advanced Vision and Imaging · Advanced Image and Video Retrieval Techniques
