Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
Sucheng Ren, Qihang Yu, Ju He, Xiaohui Shen, Alan Yuille, Liang-Chieh, Chen

TL;DR
This paper introduces xAR, a flexible autoregressive framework for visual generation that extends token prediction to entities, improving modeling granularity and reducing exposure bias, resulting in faster and more accurate image synthesis.
Contribution
xAR generalizes the notion of tokens to entities, reformulates classification as continuous regression, and employs noisy context training to mitigate exposure bias in visual autoregressive models.
Findings
xAR outperforms larger models on ImageNet-256 with faster inference.
xAR-H achieves a new state-of-the-art FID of 1.24.
xAR is 20 times faster than previous models while maintaining high quality.
Abstract
Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as continuous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsBalanced Selection
