When Vision Transformers Outperform ResNets without Pre-training or   Strong Data Augmentations

Xiangning Chen; Cho-Jui Hsieh; Boqing Gong

arXiv:2106.01548·cs.CV·March 15, 2022·103 cites

When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

Xiangning Chen, Cho-Jui Hsieh, Boqing Gong

PDF

Open Access 2 Repos 2 Models 2 Videos

TL;DR

This paper demonstrates that Vision Transformers and MLP-Mixers can outperform ResNets trained from scratch without pre-training or heavy data augmentation by promoting model smoothness, leading to better accuracy and robustness.

Contribution

It introduces a loss geometry perspective and a sharpness-aware optimizer to improve data efficiency and generalization of ViTs and MLP-Mixers without relying on large-scale pre-training.

Findings

01

ViTs and MLP-Mixers achieve higher accuracy without pre-training.

02

Sharpness-aware optimization improves robustness and training stability.

03

Models outperform ResNets of similar size on ImageNet from scratch.

Abstract

Vision Transformers (ViTs) and MLPs signal further efforts on replacing hand-wired features or inductive biases with general-purpose neural architectures. Existing works empower the models by massive data, such as large-scale pre-training and/or repeated strong data augmentations, and still report optimization-related problems (e.g., sensitivity to initialization and learning rates). Hence, this paper investigates ViTs and MLP-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and generalization at inference. Visualization and Hessian reveal extremely sharp local minima of converged models. By promoting smoothness with a recently proposed sharpness-aware optimizer, we substantially improve the accuracy and robustness of ViTs and MLP-Mixers on various tasks spanning supervised, adversarial, contrastive, and transfer learning (e.g., +5.3\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

When Vision Transformers Outperform ResNets without Pretraining | Paper Explained· youtube

When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning