MaDis-Stereo: Enhanced Stereo Matching via Distilled Masked Image Modeling
Jihye Ahn, Hyesong Choi, Soomin Kim, Dongbo Min

TL;DR
MaDis-Stereo introduces a novel training approach for Transformer-based stereo matching by combining Masked Image Modeling with knowledge distillation, significantly improving performance on benchmark datasets.
Contribution
The paper proposes MaDis-Stereo, a new method that enhances Transformer-based stereo models using MIM and EMA-based knowledge distillation to address data scarcity and improve accuracy.
Findings
Achieves state-of-the-art results on ETH3D and KITTI 2015 datasets.
Effectively leverages locality inductive bias through attention distance measurement.
Demonstrates improved training stability with dual network distillation approach.
Abstract
In stereo matching, CNNs have traditionally served as the predominant architectures. Although Transformer-based stereo models have been studied recently, their performance still lags behind CNN-based stereo models due to the inherent data scarcity issue in the stereo matching task. In this paper, we propose Masked Image Modeling Distilled Stereo matching model, termed MaDis-Stereo, that enhances locality inductive bias by leveraging Masked Image Modeling (MIM) in training Transformer-based stereo model. Given randomly masked stereo images as inputs, our method attempts to conduct both image reconstruction and depth prediction tasks. While this strategy is beneficial to resolving the data scarcity issue, the dual challenge of reconstructing masked tokens and subsequently performing stereo matching poses significant challenges, particularly in terms of training stability. To address this,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
