What Makes for Hierarchical Vision Transformer?
Yuxin Fang, Xinggang Wang, Rui Wu, Wenyu Liu

TL;DR
This paper investigates the core architectural factors behind hierarchical Vision Transformers' success, showing that macro architecture plays a more crucial role than the specific self-attention mechanism, through simple linear alternatives.
Contribution
It introduces LinMapper, a simple linear mapping-based architecture that rivals self-attention models, highlighting the importance of macro architecture over specific aggregation methods.
Findings
LinMapper achieves strong ImageNet-1k performance.
LinMapper transfers well to object detection and segmentation.
Alternative content aggregation methods yield similar results.
Abstract
Recent studies indicate that hierarchical Vision Transformer with a macro architecture of interleaved non-overlapped window-based self-attention \& shifted-window operation is able to achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. Most follow-up works attempt to replace the shifted-window operation with other kinds of cross-window communication paradigms, while treating self-attention as the de-facto standard for window-based information aggregation. In this manuscript, we question whether self-attention is the only choice for hierarchical Vision Transformer to attain strong performance, and the effects of different kinds of cross-window communication. To this end, we replace self-attention layers with embarrassingly simple linear mapping layers, and the resulting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Stochastic Depth · Swin Transformer · Adam · Layer Normalization · Byte Pair Encoding
