What Makes for Hierarchical Vision Transformer?

Yuxin Fang; Xinggang Wang; Rui Wu; Wenyu Liu

arXiv:2107.02174·cs.CV·September 13, 2021·1 cites

What Makes for Hierarchical Vision Transformer?

Yuxin Fang, Xinggang Wang, Rui Wu, Wenyu Liu

PDF

Open Access

TL;DR

This paper investigates the core architectural factors behind hierarchical Vision Transformers' success, showing that macro architecture plays a more crucial role than the specific self-attention mechanism, through simple linear alternatives.

Contribution

It introduces LinMapper, a simple linear mapping-based architecture that rivals self-attention models, highlighting the importance of macro architecture over specific aggregation methods.

Findings

01

LinMapper achieves strong ImageNet-1k performance.

02

LinMapper transfers well to object detection and segmentation.

03

Alternative content aggregation methods yield similar results.

Abstract

Recent studies indicate that hierarchical Vision Transformer with a macro architecture of interleaved non-overlapped window-based self-attention \& shifted-window operation is able to achieve state-of-the-art performance in various visual recognition tasks, and challenges the ubiquitous convolutional neural networks (CNNs) using densely slid kernels. Most follow-up works attempt to replace the shifted-window operation with other kinds of cross-window communication paradigms, while treating self-attention as the de-facto standard for window-based information aggregation. In this manuscript, we question whether self-attention is the only choice for hierarchical Vision Transformer to attain strong performance, and the effects of different kinds of cross-window communication. To this end, we replace self-attention layers with embarrassingly simple linear mapping layers, and the resulting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Brain Tumor Detection and Classification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Stochastic Depth · Swin Transformer · Adam · Layer Normalization · Byte Pair Encoding