Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Feng Wang, Yaodong Yu, Guoyizhe Wei, Wei Shao, Yuyin Zhou, Alan Yuille, Cihang Xie

TL;DR
This paper investigates how decreasing patch size in vision transformers improves performance, revealing a scaling law that extends to pixel-level tokenization and enables processing extremely long sequences with high accuracy.
Contribution
The study uncovers a scaling law in patchification, demonstrating that smaller patches consistently enhance model performance across tasks and architectures, and scales sequences up to 50,176 tokens.
Findings
Smaller patches lead to better predictive accuracy.
Models benefit from decreased patch sizes until pixel-level tokenization.
Achieved 84.6% accuracy on ImageNet-1k with 50,176 tokens.
Abstract
Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a de facto image tokenization approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively shorten the token sequence and reduce the computational cost of ViT-like plain architectures. In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1x1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Digital Media Forensic Detection
