Glance-and-Gaze Vision Transformer

Qihang Yu; Yingda Xia; Yutong Bai; Yongyi Lu; Alan Yuille; Wei Shen

arXiv:2106.02277·cs.CV·June 7, 2021·33 cites

Glance-and-Gaze Vision Transformer

Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan Yuille, Wei Shen

PDF

Open Access 1 Repo 1 Video

TL;DR

The Glance-and-Gaze Transformer (GG-Transformer) introduces a dual-branch approach inspired by human visual behavior to efficiently model both global and local dependencies, reducing computational costs while improving performance on vision tasks.

Contribution

It proposes a novel GG-Transformer architecture with parallel branches for global and local context modeling, addressing quadratic complexity issues of traditional self-attention.

Findings

01

Achieves superior performance on vision benchmarks

02

Maintains linear complexity with dilated self-attention

03

Outperforms previous state-of-the-art Transformers

Abstract

Recently, there emerges a series of vision Transformers, which show superior performance with a more compact model size than conventional convolutional neural networks, thanks to the strong ability of Transformers to model long-range dependencies. However, the advantages of vision Transformers also come with a price: Self-attention, the core part of Transformer, has a quadratic complexity to the input sequence length. This leads to a dramatic increase of computation and memory cost with the increase of sequence length, thus introducing difficulties when applying Transformers to the vision tasks that require dense predictions based on high-resolution feature maps. In this paper, we propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer), to address the aforementioned issues. It is motivated by the Glance and Gaze behavior of human beings when recognizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yucornetto/GG-Transformer
noneOfficial

Videos

Glance-and-Gaze Vision Transformer· slideslive

Taxonomy

TopicsVisual Attention and Saliency Detection · Gaze Tracking and Assistive Technology · Advanced Neural Network Applications

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Residual Connection · Dense Connections · Softmax · Multi-Head Attention