Pre-training Vision Transformers with Formula-driven Supervised Learning
Hirokatsu Kataoka, Sora Takashima, Ryo Hayamizu, Ryosuke Yamada, Kodai Nakashima, Xinyu Zhang, Edgar Josafat Martinez-Noriega, Nakamasa Inoue, Rio Yokota

TL;DR
This paper demonstrates that formula-driven supervised learning (FDSL) can pre-train vision transformers effectively without real images, surpassing traditional datasets in some cases, and explores the factors influencing its performance.
Contribution
It introduces FDSL as a viable alternative to real-image datasets for pre-training vision transformers, showing competitive results and analyzing key factors affecting performance.
Findings
FDSL matches or exceeds ImageNet-21k and JFT-300M performance.
Synthetic images from formulas avoid privacy, copyright, and bias issues.
Increasing task difficulty improves fine-tuning accuracy.
Abstract
In the present work, we show that the performance of formula-driven supervised learning (FDSL) can match or even exceed that of ImageNet-21k and can approach that of the JFT-300M dataset without the use of real images, human supervision, or self-supervision during the pre-training of vision transformers (ViTs). For example, ViT-Base pre-trained on ImageNet-21k and JFT-300M showed 83.0 and 84.1% top-1 accuracy when fine-tuned on ImageNet-1k, and FDSL showed 83.8% top-1 accuracy when pre-trained under comparable conditions (hyperparameters and number of epochs). Especially, the ExFractalDB-21k pre-training was calculated with x14.2 fewer images compared with JFT-300M. Images generated by formulas avoid privacy and copyright issues, labeling costs and errors, and biases that real images suffer from, and thus have tremendous potential for pre-training general models. To understand the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Digital Imaging for Blood Diseases · Retinal Imaging and Analysis
MethodsTest
