UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo   Matching

Soomin Kim; Hyesong Choi; Jihye Ahn; Dongbo Min

arXiv:2409.02545·cs.CV·September 5, 2024

UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching

Soomin Kim, Hyesong Choi, Jihye Ahn, Dongbo Min

PDF

Open Access

TL;DR

UniTT-Stereo introduces a unified self-supervised and supervised training framework for Transformer-based stereo matching, significantly improving performance on multiple benchmarks by leveraging feature reconstruction and adaptive masking strategies.

Contribution

It is the first to unify self-supervised pre-training with supervised stereo matching training for Transformer architectures, enhancing data efficiency and accuracy.

Findings

01

Achieves state-of-the-art results on ETH3D, KITTI 2012, and KITTI 2015 datasets.

02

Demonstrates the effectiveness of feature reconstruction and adaptive masking in limited data scenarios.

03

Provides insights into the locality inductive bias through frequency and attention map analysis.

Abstract

Unlike other vision tasks where Transformer-based approaches are becoming increasingly common, stereo depth estimation is still dominated by convolution-based approaches. This is mainly due to the limited availability of real-world ground truth for stereo matching, which is a limiting factor in improving the performance of Transformer-based stereo approaches. In this paper, we propose UniTT-Stereo, a method to maximize the potential of Transformer-based stereo architectures by unifying self-supervised learning used for pre-training with stereo matching framework based on supervised learning. To be specific, we explore the effectiveness of reconstructing features of masked portions in an input image and at the same time predicting corresponding points in another image from the perspective of locality inductive bias, which is crucial in training models with limited training data.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Signal Denoising Methods · Advanced Vision and Imaging · Advanced Image Processing Techniques

MethodsSoftmax · Attention Is All You Need