Rethinking Training Dynamics in Scale-wise Autoregressive Generation
Gengze Zhou, Chongjian Ge, Hao Tan, Feng Liu, Yicong Hong

TL;DR
This paper identifies key training challenges in scale-wise autoregressive models and proposes Self-Autoregressive Refinement (SAR), a method that improves generation quality by aligning training and inference through lightweight rollouts and contrastive supervision.
Contribution
The paper introduces SAR, a novel post-training technique combining Stagger-Scale Rollout and Contrastive Student-Forcing Loss to enhance autoregressive model training and generation quality.
Findings
SAR reduces FID by 5.2% on ImageNet 256 within 10 epochs
SAR improves generation quality with minimal computational overhead
The method is scalable and effective as a post-training enhancement
Abstract
Recent advances in autoregressive (AR) generative models have produced increasingly powerful systems for media synthesis. Among them, next-scale prediction has emerged as a popular paradigm, where models generate images in a coarse-to-fine manner. However, scale-wise AR models suffer from exposure bias, which undermines generation quality. We identify two primary causes of this issue: (1) train-test mismatch, where the model must rely on its own imperfect predictions during inference, and (2) imbalance in scale-wise learning difficulty, where certain scales exhibit disproportionately higher optimization complexity. Through a comprehensive analysis of training dynamics, we propose Self-Autoregressive Refinement (SAR) to address these limitations. SAR introduces a Stagger-Scale Rollout (SSR) mechanism that performs lightweight autoregressive rollouts to expose the model to its own…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Face recognition and analysis
