2T-UNET: A Two-Tower UNet with Depth Clues for Robust Stereo Depth Estimation
Rohit Choudhary, Mansi Sharma, Rithvik Anil

TL;DR
The paper introduces 2T-UNet, a novel two-tower CNN architecture that leverages depth clues and different inputs to improve stereo depth estimation without explicit stereo matching, outperforming existing methods.
Contribution
It proposes a new two-tower network architecture that replaces cost volume construction with twin convolutional towers and incorporates monocular depth clues for enhanced stereo depth estimation.
Findings
Outperforms state-of-the-art methods on Scene flow dataset
Effective on complex natural scenes
Suitable for real-time applications
Abstract
Stereo correspondence matching is an essential part of the multi-step stereo depth estimation process. This paper revisits the depth estimation problem, avoiding the explicit stereo matching step using a simple two-tower convolutional neural network. The proposed algorithm is entitled as 2T-UNet. The idea behind 2T-UNet is to replace cost volume construction with twin convolution towers. These towers have an allowance for different weights between them. Additionally, the input for twin encoders in 2T-UNet are different compared to the existing stereo methods. Generally, a stereo network takes a right and left image pair as input to determine the scene geometry. However, in the 2T-UNet model, the right stereo image is taken as one input and the left stereo image along with its monocular depth clue information, is taken as the other input. Depth clues provide complementary suggestions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Advanced Image and Video Retrieval Techniques · Advanced Image Processing Techniques
MethodsConvolution
