Next Patch Prediction for Autoregressive Visual Generation
Yatian Pang, Peng Jin, Shuo Yang, Bin Lin, Bin Zhu, Zhenyu Tang,, Liuhan Chen, Francis E. H. Tay, Ser-Nam Lim, Harry Yang, Li Yuan

TL;DR
This paper introduces Next Patch Prediction (NPP), a hierarchical patch-based autoregressive approach that reduces training costs and improves image generation quality without altering the original model architecture.
Contribution
We propose a novel NPP paradigm that groups image tokens into patches, employs a multi-scale coarse-to-fine strategy, and maintains the original model structure, enhancing efficiency and performance.
Findings
Reduces training cost to 0.6x of previous methods
Improves FID score by up to 1.0 on ImageNet 256x256
Retains original autoregressive architecture without extra parameters
Abstract
Autoregressive models, built based on the Next Token Prediction (NTP) paradigm, show great potential in developing a unified framework that integrates both language and vision tasks. Pioneering works introduce NTP to autoregressive visual generation tasks. In this work, we rethink the NTP for autoregressive image generation and extend it to a novel Next Patch Prediction (NPP) paradigm. Our key idea is to group and aggregate image tokens into patch tokens with higher information density. By using patch tokens as a more compact input sequence, the autoregressive model is trained to predict the next patch, significantly reducing computational costs. To further exploit the natural hierarchical structure of image data, we propose a multi-scale coarse-to-fine patch grouping strategy. With this strategy, the training process begins with a large patch size and ends with vanilla NTP where the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging
