Scaled ReLU Matters for Training Vision Transformers
Pichao Wang, Xue Wang, Hao Luo, Jingkai Zhou, Zhipeng Zhou, and Fan Wang, Hao Li, Rong Jin

TL;DR
This paper demonstrates that using scaled ReLU in the convolutional stem of vision transformers significantly stabilizes training and enhances performance, revealing that proper architectural choices are crucial for ViT training success.
Contribution
It introduces the importance of scaled ReLU in the conv-stem for stable training and improved performance of vision transformers, supported by theoretical and empirical analysis.
Findings
Scaled ReLU in conv-stem stabilizes training.
Enhanced diversity of patch tokens boosts performance.
Previous ViTs are undertrained, indicating potential for improvement.
Abstract
Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in ~\cite{xiao2021early}, and the authors conjecture that the issue lies with the \textit{patchify-stem} of ViT models and propose that early convolutions help transformers see better. In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the \textit{convolutional stem} (\textit{conv-stem}) matters. We verify, both theoretically and empirically, that scaled ReLU in \textit{conv-stem} not only improves training stabilization, but also increases the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Infrared Target Detection Methodologies · CCD and CMOS Imaging Sensors
