Improving Depth Gradient Continuity in Transformers: A Comparative Study   on Monocular Depth Estimation with CNN

Jiawei Yao; Tong Wu; Xiaofeng Zhang

arXiv:2308.08333·cs.CV·July 25, 2024·45 cites

Improving Depth Gradient Continuity in Transformers: A Comparative Study on Monocular Depth Estimation with CNN

Jiawei Yao, Tong Wu, Xiaofeng Zhang

PDF

Open Access

TL;DR

This paper compares Transformers and CNNs in monocular depth estimation, identifies their strengths and weaknesses, and introduces a novel Depth Gradient Refinement module and a loss function based on optimal transport to improve Transformer performance.

Contribution

The paper presents a new DGR module and a novel loss function that enhance Transformer-based depth estimation without added complexity.

Findings

01

Transformers excel in global context and texture handling.

02

CNNs better preserve depth gradient continuity.

03

The proposed methods improve accuracy on KITTI and NYU-Depth-v2 datasets.

Abstract

Monocular depth estimation is an ongoing challenge in computer vision. Recent progress with Transformer models has demonstrated notable advantages over conventional CNNs in this area. However, there's still a gap in understanding how these models prioritize different regions in 2D images and how these regions affect depth estimation performance. To explore the differences between Transformers and CNNs, we employ a sparse pixel approach to contrastively analyze the distinctions between the two. Our findings suggest that while Transformers excel in handling global context and intricate textures, they lag behind CNNs in preserving depth gradient continuity. To further enhance the performance of Transformer models in monocular depth estimation, we propose the Depth Gradient Refinement (DGR) module that refines depth estimation through high-order differentiation, feature fusion, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Image Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Layer Normalization · Softmax · Absolute Position Encodings · Residual Connection · Dense Connections · Dropout