From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation
Cheng Cheng, Lin Song, Di An, Yicheng Xiao, Xuchong Zhang, Hongbin Sun, Ying Shan

TL;DR
This paper introduces TensorAR, a novel autoregressive image generation method that predicts overlapping image tensors to enable iterative refinement, significantly enhancing generation quality over traditional AR models.
Contribution
TensorAR reformulates AR image generation from token to tensor prediction, allowing iterative refinement and improving quality without retraining existing models.
Findings
TensorAR improves image generation quality across multiple datasets.
The discrete tensor noising scheme effectively prevents information leakage.
TensorAR is compatible with existing AR models as a plug-and-play module.
Abstract
Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that…
Peer Reviews
Decision·ICLR 2026 Poster
(1) The overall paradigm is very interesting, which naturally combines the traditional AR image generation with the diffusion model. (2) The method is effective. Extensive experiments over Open-MAGVIT and LlamaGEN have proven its effectiveness.
1. The illustration in Fig. 1 is not intuitive. 2. The llamagen baselins is a little weak. AS I know, SimpleAR (https://github.com/wdrink/SimpleAR) is a stronger baseline. The experiments could be improved by utilizing SoTA baselines.
1. This paper is well-written and provides a thorough and accurate explanation of the main methods. 2. The core idea of "next-tensor prediction" is simple yet powerful. It provide a new approach to bridge the gap between autoregressive generation and refinement-based paradigms. 3. TensorAR requires no modification to the base AR architecture or training objective, making it highly practical and easy to integrate with existing models. 4. The author conducted experimental verifications on various
1. Lack of a stronger explanation or demonstration of refinement: The paper claims that tokens are "refined" over multiple steps, but it does not provide direct evidence that the model actually revises its predictions meaningfully during refinement steps, rather than simply generating forward. TensorAR improves the T2I generation effect as shown in Figure 7. The baseline fails on "a person stands with another man" but TensorAR succeeds—is this due to refinement of earlier tokens, or improvement
* The paper is well-written. * The proposed method is simple and easy to implement. It can be plugged into most of the existing image AR models. * The paper demonstrates the effectiveness of the method across multiple image AR models and multiple tasks (class-conditioned generation and text-to-image generation).
* The paper contains many typos, though they are minor and should be easy to fix.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Multimodal Machine Learning Applications
MethodsDiffusion
