Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction?

Xinchen Yan; Chen Liang; Lijun Yu; Adams Wei Yu; Yifeng Lu; Quoc V. Le

arXiv:2511.08704·cs.CV·May 19, 2026

Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction?

Xinchen Yan, Chen Liang, Lijun Yu, Adams Wei Yu, Yifeng Lu, Quoc V. Le

PDF

TL;DR

This paper explores how autoregressive next-pixel prediction models scale with compute, data, and resolution, revealing that compute is the main bottleneck and forecasting future capabilities of pixel-level image modeling.

Contribution

It provides a detailed analysis of scaling strategies for autoregressive image models, highlighting task-dependent optimal scaling and the dominance of compute as a bottleneck.

Findings

01

Optimal scaling strategies differ for classification and generation tasks.

02

Model size must grow faster than data size with increasing resolution.

03

Compute is identified as the primary bottleneck for scaling image models.

Abstract

This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation-based completion measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed resolution of 32x32 alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment · Visual perception and processing mechanisms