SwinDepth: Unsupervised Depth Estimation using Monocular Sequences via Swin Transformer and Densely Cascaded Network
Dongseok Shim, H. Jin Kim

TL;DR
This paper introduces SwinDepth, an unsupervised monocular depth estimation method that leverages a Swin Transformer for feature extraction and a densely cascaded network for multi-scale depth prediction, outperforming existing methods.
Contribution
It proposes a novel architecture combining Swin Transformer and densely cascaded connections for improved unsupervised depth estimation from monocular sequences.
Findings
Outperforms state-of-the-art unsupervised methods on KITTI and Make3D datasets.
Utilizes a convolution-free Swin Transformer for better feature representation.
Densely cascaded network enhances multi-scale depth prediction quality.
Abstract
Monocular depth estimation plays a critical role in various computer vision and robotics applications such as localization, mapping, and 3D object detection. Recently, learning-based algorithms achieve huge success in depth estimation by training models with a large amount of data in a supervised manner. However, it is challenging to acquire dense ground truth depth labels for supervised training, and the unsupervised depth estimation using monocular sequences emerges as a promising alternative. Unfortunately, most studies on unsupervised depth estimation explore loss functions or occlusion masks, and there is little change in model architecture in that ConvNet-based encoder-decoder structure becomes a de-facto standard for depth estimation. In this paper, we employ a convolution-free Swin Transformer as an image feature extractor so that the network can capture both local geometric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Dropout · Adam · Stochastic Depth · Byte Pair Encoding · Residual Connection · Label Smoothing · Dense Connections
