Vision Transformers for Dense Prediction

Ren\'e Ranftl; Alexey Bochkovskiy; Vladlen Koltun

arXiv:2103.13413·cs.CV·March 26, 2021

Vision Transformers for Dense Prediction

Ren\'e Ranftl, Alexey Bochkovskiy, Vladlen Koltun

PDF

5 Repos 10 Models

TL;DR

This paper introduces dense vision transformers that replace convolutional backbones for dense prediction tasks, achieving superior accuracy and global coherence, especially with large datasets.

Contribution

It proposes a novel dense vision transformer architecture that maintains high-resolution processing and global receptive fields, improving dense prediction performance over traditional convolutional networks.

Findings

01

Up to 28% improvement in monocular depth estimation.

02

New state-of-the-art results on ADE20K semantic segmentation.

03

Effective fine-tuning on smaller datasets like NYUv2 and KITTI.

Abstract

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSix Ways To Communicate To Someone At Expedia Via Phone And Email's. · Linear Layer · Convolution · Residual Connection · Layer Normalization · Dense Prediction Transformer · Dense Connections · Softmax · Multi-Head Attention · Attention Is All You Need