STAR: Scale-wise Text-conditioned AutoRegressive image generation

Xiaoxiao Ma; Mohan Zhou; Tao Liang; Yalong Bai; Tiejun Zhao; Biye Li,; Huaian Chen; Yi Jin

arXiv:2406.10797·cs.CV·February 20, 2025

STAR: Scale-wise Text-conditioned AutoRegressive image generation

Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Biye Li,, Huaian Chen, Yi Jin

PDF

Open Access 1 Repo 1 Datasets

TL;DR

STAR is a novel auto-regressive text-to-image model that enables high-resolution image generation up to 1024x1024 by introducing scale-wise generation, a pre-trained text encoder, and a stable sampling method, surpassing previous models in quality.

Contribution

The paper presents a scale-wise auto-regressive framework with new design strategies, including a pre-trained text encoder, normalized 2D Rotary Positional Encoding, and a stable sampling method, enabling high-resolution and stable image synthesis.

Findings

01

STAR achieves high-fidelity 1024x1024 images in 2.21 seconds.

02

It surpasses existing diffusion and auto-regressive models in quality and consistency.

03

The method stabilizes high-resolution generation through novel sampling techniques.

Abstract

We introduce STAR, a text-to-image model that employs a scale-wise auto-regressive paradigm. Unlike VAR, which is constrained to class-conditioned synthesis for images up to 256 $\times$ 256, STAR enables text-driven image generation up to 1024 $\times$ 1024 through three key designs. First, we introduce a pre-trained text encoder to extract and adopt representations for textual constraints, enhancing details and generalizability. Second, given the inherent structural correlation across different scales, we leverage 2D Rotary Positional Encoding (RoPE) and tweak it into a normalized version, ensuring consistent interpretation of relative positions across token maps and stabilizing the training process. Third, we observe that simultaneously sampling all tokens within a single scale can disrupt inter-token relationships, leading to structural instability, particularly in high-resolution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

krennic999/STAR
noneOfficial

Datasets

slz1/wxy_var
dataset· 1.1k dl
1.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Image Retrieval and Classification Techniques

MethodsSparse Evolutionary Training · Diffusion