Vision Transformers with Patch Diversification

Chengyue Gong; Dilin Wang; Meng Li; Vikas Chandra; Qiang Liu

arXiv:2104.12753·cs.CV·June 14, 2021·42 cites

Vision Transformers with Patch Diversification

Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, Qiang Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel training approach for vision transformers that promotes diversity among patch representations, leading to more stable training and improved performance on various vision tasks.

Contribution

It proposes new loss functions to explicitly encourage patch diversity, stabilizing training without modifying the transformer architecture.

Findings

01

Stabilizes training of wider and deeper vision transformers.

02

Enhances transfer learning performance on downstream tasks.

03

Improves state-of-the-art results in semantic segmentation.

Abstract

Vision transformer has demonstrated promising performance on challenging computer vision tasks. However, directly training the vision transformers may yield unstable and sub-optimal results. Recent works propose to improve the performance of the vision transformers by modifying the transformer structures, e.g., incorporating convolution layers. In contrast, we investigate an orthogonal approach to stabilize the vision transformer training without modifying the networks. We observe the instability of the training can be attributed to the significant similarity across the extracted patch representations. More specifically, for deep vision transformers, the self-attention blocks tend to map different patches into similar latent representations, yielding information loss and performance degradation. To alleviate this problem, in this work, we introduce novel loss functions in vision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ChengyueGongR/PatchVisionTransformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Dense Connections · Softmax · Multi-Head Attention · Vision Transformer · Convolution