Global Context Vision Transformers

Ali Hatamizadeh; Hongxu Yin; Greg Heinrich; Jan Kautz; and Pavlo; Molchanov

arXiv:2206.09959·cs.CV·June 7, 2023·34 cites

Global Context Vision Transformers

Ali Hatamizadeh, Hongxu Yin, Greg Heinrich, Jan Kautz, and Pavlo, Molchanov

PDF

Open Access 5 Repos 2 Models 1 Video

TL;DR

The paper introduces GC ViT, a global context vision transformer that efficiently models long and short-range interactions, achieving state-of-the-art results in image classification, detection, and segmentation without expensive operations.

Contribution

It presents a novel architecture combining global and local self-attention, along with modified residual blocks, to improve parameter and compute efficiency in vision transformers.

Findings

01

GC ViT achieves 85.7% Top-1 accuracy on ImageNet-1K without pre-training.

02

Pre-trained GC ViT outperforms prior models in object detection and segmentation tasks.

03

GC ViT surpasses CNN-based and other ViT-based models in multiple vision benchmarks.

Abstract

We propose global context vision transformer (GC ViT), a novel architecture that enhances parameter and compute utilization for computer vision. Our method leverages global context self-attention modules, joint with standard local self-attention, to effectively and efficiently model both long and short-range spatial interactions, without the need for expensive operations such as computing attention masks or shifting local windows. In addition, we address the lack of the inductive bias in ViTs, and propose to leverage a modified fused inverted residual blocks in our architecture. Our proposed GC ViT achieves state-of-the-art results across image classification, object detection and semantic segmentation tasks. On ImageNet-1K dataset for classification, the variants of GC ViT with 51M, 90M and 201M parameters achieve 84.3%, 85.0% and 85.7% Top-1 accuracy, respectively, at 224 image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Global Context Vision Transformers· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · ConvNeXt · Linear Layer · Balanced Selection · Dropout · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing