From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation

Cheng Cheng; Lin Song; Di An; Yicheng Xiao; Xuchong Zhang; Hongbin Sun; Ying Shan

arXiv:2505.16324·cs.CV·January 29, 2026

From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation

Cheng Cheng, Lin Song, Di An, Yicheng Xiao, Xuchong Zhang, Hongbin Sun, Ying Shan

PDF

Open Access 3 Reviews

TL;DR

This paper introduces TensorAR, a novel autoregressive image generation method that predicts overlapping image tensors to enable iterative refinement, significantly enhancing generation quality over traditional AR models.

Contribution

TensorAR reformulates AR image generation from token to tensor prediction, allowing iterative refinement and improving quality without retraining existing models.

Findings

01

TensorAR improves image generation quality across multiple datasets.

02

The discrete tensor noising scheme effectively prevents information leakage.

03

TensorAR is compatible with existing AR models as a plug-and-play module.

Abstract

Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

(1) The overall paradigm is very interesting, which naturally combines the traditional AR image generation with the diffusion model. (2) The method is effective. Extensive experiments over Open-MAGVIT and LlamaGEN have proven its effectiveness.

Weaknesses

1. The illustration in Fig. 1 is not intuitive. 2. The llamagen baselins is a little weak. AS I know, SimpleAR (https://github.com/wdrink/SimpleAR) is a stronger baseline. The experiments could be improved by utilizing SoTA baselines.

Reviewer 02Rating 6Confidence 4

Strengths

1. This paper is well-written and provides a thorough and accurate explanation of the main methods. 2. The core idea of "next-tensor prediction" is simple yet powerful. It provide a new approach to bridge the gap between autoregressive generation and refinement-based paradigms. 3. TensorAR requires no modification to the base AR architecture or training objective, making it highly practical and easy to integrate with existing models. 4. The author conducted experimental verifications on various

Weaknesses

1. Lack of a stronger explanation or demonstration of refinement: The paper claims that tokens are "refined" over multiple steps, but it does not provide direct evidence that the model actually revises its predictions meaningfully during refinement steps, rather than simply generating forward. TensorAR improves the T2I generation effect as shown in Figure 7. The baseline fails on "a person stands with another man" but TensorAR succeeds—is this due to refinement of earlier tokens, or improvement

Reviewer 03Rating 6Confidence 4

Strengths

* The paper is well-written. * The proposed method is simple and easy to implement. It can be plugged into most of the existing image AR models. * The paper demonstrates the effectiveness of the method across multiple image AR models and multiple tasks (class-conditioned generation and text-to-image generation).

Weaknesses

* The paper contains many typos, though they are minor and should be easy to fix.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Multimodal Machine Learning Applications

MethodsDiffusion