Vision Transformers with Patch Diversification
Chengyue Gong, Dilin Wang, Meng Li, Vikas Chandra, Qiang Liu

TL;DR
This paper introduces a novel training approach for vision transformers that promotes diversity among patch representations, leading to more stable training and improved performance on various vision tasks.
Contribution
It proposes new loss functions to explicitly encourage patch diversity, stabilizing training without modifying the transformer architecture.
Findings
Stabilizes training of wider and deeper vision transformers.
Enhances transfer learning performance on downstream tasks.
Improves state-of-the-art results in semantic segmentation.
Abstract
Vision transformer has demonstrated promising performance on challenging computer vision tasks. However, directly training the vision transformers may yield unstable and sub-optimal results. Recent works propose to improve the performance of the vision transformers by modifying the transformer structures, e.g., incorporating convolution layers. In contrast, we investigate an orthogonal approach to stabilize the vision transformer training without modifying the networks. We observe the instability of the training can be attributed to the significant similarity across the extracted patch representations. More specifically, for deep vision transformers, the self-attention blocks tend to map different patches into similar latent representations, yielding information loss and performance degradation. To alleviate this problem, in this work, we introduce novel loss functions in vision…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Dense Connections · Softmax · Multi-Head Attention · Vision Transformer · Convolution
