TL;DR
This paper introduces dense vision transformers that replace convolutional backbones for dense prediction tasks, achieving superior accuracy and global coherence, especially with large datasets.
Contribution
It proposes a novel dense vision transformer architecture that maintains high-resolution processing and global receptive fields, improving dense prediction performance over traditional convolutional networks.
Findings
Up to 28% improvement in monocular depth estimation.
New state-of-the-art results on ADE20K semantic segmentation.
Effective fine-tuning on smaller datasets like NYUv2 and KITTI.
Abstract
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Intel/dpt-largemodel· 85k dl· ♡ 20485k dl♡ 204
- 🤗Intel/dpt-hybrid-midasmodel· 496k dl· ♡ 105496k dl♡ 105
- 🤗Intel/ldm3dmodel· 28 dl· ♡ 6428 dl♡ 64
- 🤗Intel/dpt-large-ademodel· 3.3k dl· ♡ 133.3k dl♡ 13
- 🤗kiheh85202/yolomodel· 10 dl· ♡ 110 dl♡ 1
- 🤗Intel/ldm3d-4cmodel· 157 dl· ♡ 45157 dl♡ 45
- 🤗facebook/dpt-dinov2-small-nyumodel· 135 dl· ♡ 3135 dl♡ 3
- 🤗facebook/dpt-dinov2-small-kittimodel· 335 dl· ♡ 8335 dl♡ 8
- 🤗facebook/dpt-dinov2-base-kittimodel· 88 dl· ♡ 288 dl♡ 2
- 🤗facebook/dpt-dinov2-base-nyumodel· 369 dl369 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSix Ways To Communicate To Someone At Expedia Via Phone And Email's. · Linear Layer · Convolution · Residual Connection · Layer Normalization · Dense Prediction Transformer · Dense Connections · Softmax · Multi-Head Attention · Attention Is All You Need
