Patches Are All You Need?
Asher Trockman, J. Zico Kolter

TL;DR
This paper introduces ConvMixer, a simple convolutional model that operates on image patches and outperforms more complex Transformer-based models like ViT, demonstrating that patch-based input and convolutional mixing are highly effective for vision tasks.
Contribution
ConvMixer is a novel, simple convolutional architecture that operates directly on patches and surpasses ViT and other models in performance, challenging the notion that Transformers are inherently superior.
Findings
ConvMixer outperforms ViT, MLP-Mixer, and ResNet on similar datasets.
ConvMixer uses only standard convolutions for patch mixing.
Patch-based input and convolutional mixing are highly effective for vision models.
Abstract
Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Face Recognition and Perception · Advanced Memory and Neural Computing
MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Position-Wise Feed-Forward Layer · Average Pooling · Adam · 1x1 Convolution · Global Average Pooling · Label Smoothing
