Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN
Jiawei Yao, Tong Wu, Xiaofeng Zhang

TL;DR
This paper compares Transformers and CNNs in monocular depth estimation, identifies their strengths and weaknesses, and introduces a novel Depth Gradient Refinement module and a loss function based on optimal transport to improve Transformer performance.
Contribution
The paper presents a new DGR module and a novel loss function that enhance Transformer-based depth estimation without added complexity.
Findings
Transformers excel in global context and texture handling.
CNNs better preserve depth gradient continuity.
The proposed methods improve accuracy on KITTI and NYU-Depth-v2 datasets.
Abstract
Monocular depth estimation is an ongoing challenge in computer vision. Recent progress with Transformer models has demonstrated notable advantages over conventional CNNs in this area. However, there's still a gap in understanding how these models prioritize different regions in 2D images and how these regions affect depth estimation performance. To explore the differences between Transformers and CNNs, we employ a sparse pixel approach to contrastively analyze the distinctions between the two. Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity. To further enhance the performance of Transformer models in monocular depth estimation, we propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Image Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Softmax · Absolute Position Encodings · Residual Connection · Dense Connections · Dropout
