When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
Xiangning Chen, Cho-Jui Hsieh, Boqing Gong

TL;DR
This paper demonstrates that Vision Transformers and MLP-Mixers can outperform ResNets trained from scratch without pre-training or heavy data augmentation by promoting model smoothness, leading to better accuracy and robustness.
Contribution
It introduces a loss geometry perspective and a sharpness-aware optimizer to improve data efficiency and generalization of ViTs and MLP-Mixers without relying on large-scale pre-training.
Findings
ViTs and MLP-Mixers achieve higher accuracy without pre-training.
Sharpness-aware optimization improves robustness and training stability.
Models outperform ResNets of similar size on ImageNet from scratch.
Abstract
Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pre-training and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
