Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction?
Xinchen Yan, Chen Liang, Lijun Yu, Adams Wei Yu, Yifeng Lu, Quoc V. Le

TL;DR
This paper explores how autoregressive next-pixel prediction models scale with compute, data, and resolution, revealing that compute is the main bottleneck and forecasting future capabilities of pixel-level image modeling.
Contribution
It provides a detailed analysis of scaling strategies for autoregressive image models, highlighting task-dependent optimal scaling and the dominance of compute as a bottleneck.
Findings
Optimal scaling strategies differ for classification and generation tasks.
Model size must grow faster than data size with increasing resolution.
Compute is identified as the primary bottleneck for scaling image models.
Abstract
This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation-based completion measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed resolution of 32x32 alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment · Visual perception and processing mechanisms
