Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction
Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, Elisa Ricci

TL;DR
This paper introduces TransDepth, a novel architecture combining CNNs and transformers with a gated attention decoder, achieving state-of-the-art results in continuous pixel-wise prediction tasks like depth and surface normal estimation.
Contribution
It is the first to apply transformers to pixel-wise continuous label prediction, integrating a gated attention decoder to preserve local details.
Findings
Achieves state-of-the-art performance on three datasets
Effectively models long-range dependencies in pixel-wise tasks
Demonstrates the benefit of combining CNNs and transformers
Abstract
While convolutional neural networks have shown a tremendous impact on various computer vision tasks, they generally demonstrate limitations in explicitly modeling long-range dependencies due to the intrinsic locality of the convolution operation. Initially designed for natural language processing tasks, Transformers have emerged as alternative architectures with innate global self-attention mechanisms to capture long-range dependencies. In this paper, we propose TransDepth, an architecture that benefits from both convolutional neural networks and transformers. To avoid the network losing its ability to capture local-level details due to the adoption of transformers, we propose a novel decoder that employs attention mechanisms based on gates. Notably, this is the first paper that applies transformers to pixel-wise prediction problems involving continuous labels (i.e., monocular depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
MethodsConvolution
