Glance-and-Gaze Vision Transformer
Qihang Yu, Yingda Xia, Yutong Bai, Yongyi Lu, Alan Yuille, Wei Shen

TL;DR
The Glance-and-Gaze Transformer (GG-Transformer) introduces a dual-branch approach inspired by human visual behavior to efficiently model both global and local dependencies, reducing computational costs while improving performance on vision tasks.
Contribution
It proposes a novel GG-Transformer architecture with parallel branches for global and local context modeling, addressing quadratic complexity issues of traditional self-attention.
Findings
Achieves superior performance on vision benchmarks
Maintains linear complexity with dilated self-attention
Outperforms previous state-of-the-art Transformers
Abstract
Recently, there emerges a series of vision Transformers, which show superior performance with a more compact model size than conventional convolutional neural networks, thanks to the strong ability of Transformers to model long-range dependencies. However, the advantages of vision Transformers also come with a price: Self-attention, the core part of Transformer, has a quadratic complexity to the input sequence length. This leads to a dramatic increase of computation and memory cost with the increase of sequence length, thus introducing difficulties when applying Transformers to the vision tasks that require dense predictions based on high-resolution feature maps. In this paper, we propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer), to address the aforementioned issues. It is motivated by the Glance and Gaze behavior of human beings when recognizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVisual Attention and Saliency Detection · Gaze Tracking and Assistive Technology · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Residual Connection · Dense Connections · Softmax · Multi-Head Attention
