Pre-training Vision Transformers with Formula-driven Supervised Learning

Hirokatsu Kataoka; Sora Takashima; Ryo Hayamizu; Ryosuke Yamada; Kodai Nakashima; Xinyu Zhang; Edgar Josafat Martinez-Noriega; Nakamasa Inoue; Rio Yokota

arXiv:2206.09132·cs.CV·December 29, 2025·1 cites

Pre-training Vision Transformers with Formula-driven Supervised Learning

Hirokatsu Kataoka, Sora Takashima, Ryo Hayamizu, Ryosuke Yamada, Kodai Nakashima, Xinyu Zhang, Edgar Josafat Martinez-Noriega, Nakamasa Inoue, Rio Yokota

PDF

Open Access

TL;DR

This paper demonstrates that formula-driven supervised learning (FDSL) can pre-train vision transformers effectively without real images, surpassing traditional datasets in some cases, and explores the factors influencing its performance.

Contribution

It introduces FDSL as a viable alternative to real-image datasets for pre-training vision transformers, showing competitive results and analyzing key factors affecting performance.

Findings

01

FDSL matches or exceeds ImageNet-21k and JFT-300M performance.

02

Synthetic images from formulas avoid privacy, copyright, and bias issues.

03

Increasing task difficulty improves fine-tuning accuracy.

Abstract

In the present work, we show that the performance of formula-driven supervised learning (FDSL) can match or even exceed that of ImageNet-21k and can approach that of the JFT-300M dataset without the use of real images, human supervision, or self-supervision during the pre-training of vision transformers (ViTs). For example, ViT-Base pre-trained on ImageNet-21k and JFT-300M showed 83.0 and 84.1% top-1 accuracy when fine-tuned on ImageNet-1k, and FDSL showed 83.8% top-1 accuracy when pre-trained under comparable conditions (hyperparameters and number of epochs). Especially, the ExFractalDB-21k pre-training was calculated with x14.2 fewer images compared with JFT-300M. Images generated by formulas avoid privacy and copyright issues, labeling costs and errors, and biases that real images suffer from, and thus have tremendous potential for pre-training general models. To understand the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Digital Imaging for Blood Diseases · Retinal Imaging and Analysis

MethodsTest