Condition Errors Refinement in Autoregressive Image Generation with Diffusion Loss
Yucheng Zhou, Hao Li, Jianbing Shen

TL;DR
This paper provides a theoretical analysis and a novel condition refinement method for autoregressive image generation with diffusion loss, demonstrating improved stability and condition consistency over existing models.
Contribution
It introduces a theoretical comparison of diffusion and autoregressive models with diffusion loss and proposes a Wasserstein Gradient Flow-based condition refinement approach.
Findings
Autoregressive models with diffusion loss effectively mitigate condition errors.
The proposed OT-based refinement ensures convergence to the ideal condition distribution.
Experiments show the method outperforms existing diffusion and autoregressive models.
Abstract
Recent studies have explored autoregressive models for image generation, with promising results, and have combined diffusion models with autoregressive frameworks to optimize image generation via diffusion losses. In this study, we present a theoretical analysis of diffusion and autoregressive models with diffusion loss, highlighting the latter's advantages. We present a theoretical comparison of conditional diffusion and autoregressive diffusion with diffusion loss, demonstrating that patch denoising optimization in autoregressive models effectively mitigates condition errors and leads to a stable condition distribution. Our analysis also reveals that autoregressive condition generation refines the condition, causing the condition error influence to decay exponentially. In addition, we introduce a novel condition refinement approach based on Optimal Transport (OT) theory to address…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper's primary strength lies in its rigorous theoretical contributions. The formalization of the condition refinement process as a Wasserstein Gradient Flow is both elegant and novel, providing a principled guarantee of convergence that is often missing in heuristic-based approaches. The detailed lemmas and theorems build a convincing mathematical argument. 2. The paper is logically well-organized. It seamlessly transitions from a comparative analysis of diffusion models, to the definiti
1. Limited Experimental Scale: As the authors acknowledge in Appendix B, the experiments are confined to the 256x256 resolution on ImageNet. While this is a standard benchmark, state-of-the-art generative modeling research is increasingly focused on higher resolutions and larger models. The absence of such experiments may leave questions about the method's scalability and generalizability. 2. Readability and Accessibility: The theoretical sections are dense and assume significant familiarity wit
1. The theoretical framework for autoregressive image modeling with diffusion loss is both sound and novel. The theory is rigorous and clearly connects diffusion loss to autoregressive conditional modeling. The mathematical exposition is clear and technically solid. 2. The proposed ideas of autoregressive patch-wise denoising and OT-based condition refinement are conceptually well-motivated. 3. The paper provides a rigorous theoretical analysis demonstrating that the patch-wise denoising optim
1. The empirical validation does not fully match the strength of the theory. The main theory predicts (i) conditional score norm decays exponentially as AR iterations progress and (ii) OT refinement decreases condition inconsistency (Sinkhorn divergence) monotonically. The paper lacks direct empirical plots that verify these claims. 2. Lack of important experiments at higher resolutions, such as the ImageNet 512 × 512 experiment. 3. The comparison against stronger and more recent baselines (af
(1)The theoretical part is quite impressive. Especially Theorem 2 gives a deep understanding about how the conditional influence (the gradient norm) exponentially decays, which brings new insights into the stability of AR generative models. The combination of OT and WGF for refining conditional distribution looks creative and convincing. (2)The experimental results are strong. On ImageNet 256x256, the FID score reaches 1.31, which is very competitive compared with existing works.
(1) Algorithm 1 seems to describe a nested loop structure. I am a bit worry that the computation cost could be large, maybe even K times T slower than the standard AR model. Some clarification or runtime comparison could be helpful. (2) It is a bit unclear what “Baseline (CDM)” (FID 3.26) and “Baseline” (FID 2.02) exactly mean. Does “Baseline” refer to the AR model without OT refinement? Since the paper’s best FID (1.31) is quite good, some ablation study would help to show how much improvement
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Medical Image Segmentation Techniques · Computer Graphics and Visualization Techniques
