STAR: Scale-wise Text-conditioned AutoRegressive image generation
Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Biye Li,, Huaian Chen, Yi Jin

TL;DR
STAR is a novel auto-regressive text-to-image model that enables high-resolution image generation up to 1024x1024 by introducing scale-wise generation, a pre-trained text encoder, and a stable sampling method, surpassing previous models in quality.
Contribution
The paper presents a scale-wise auto-regressive framework with new design strategies, including a pre-trained text encoder, normalized 2D Rotary Positional Encoding, and a stable sampling method, enabling high-resolution and stable image synthesis.
Findings
STAR achieves high-fidelity 1024x1024 images in 2.21 seconds.
It surpasses existing diffusion and auto-regressive models in quality and consistency.
The method stabilizes high-resolution generation through novel sampling techniques.
Abstract
We introduce STAR, a text-to-image model that employs a scale-wise auto-regressive paradigm. Unlike VAR, which is constrained to class-conditioned synthesis for images up to 256256, STAR enables text-driven image generation up to 10241024 through three key designs. First, we introduce a pre-trained text encoder to extract and adopt representations for textual constraints, enhancing details and generalizability. Second, given the inherent structural correlation across different scales, we leverage 2D Rotary Positional Encoding (RoPE) and tweak it into a normalized version, ensuring consistent interpretation of relative positions across token maps and stabilizing the training process. Third, we observe that simultaneously sampling all tokens within a single scale can disrupt inter-token relationships, leading to structural instability, particularly in high-resolution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques
MethodsSparse Evolutionary Training · Diffusion
