ControlAR: Controllable Image Generation with Autoregressive Models

Zongming Li; Tianheng Cheng; Shoufa Chen; Peize Sun; Haocheng Shen,; Longjin Ran; Xiaoxin Chen; Wenyu Liu; Xinggang Wang

arXiv:2410.02705·cs.CV·March 11, 2025

ControlAR: Controllable Image Generation with Autoregressive Models

Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen,, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang

PDF

Open Access 1 Repo 1 Models 1 Datasets 1 Video 3 Reviews

TL;DR

ControlAR introduces a novel, efficient autoregressive framework that enhances controllable image generation by integrating spatial controls through a lightweight encoder and conditional decoding, surpassing previous diffusion models.

Contribution

The paper proposes a new method for controllable image generation with AR models, using control encoding and conditional decoding to improve control and efficiency.

Findings

01

ControlAR achieves superior controllability over diverse spatial inputs.

02

It enables arbitrary-resolution image generation with high quality.

03

Outperforms state-of-the-art controllable diffusion models in experiments.

Abstract

Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The proposed method enables fine-grained control in autoregressive image generation by using a control encoder and conditional decoding, achieving high image quality with low additional training cost. 2. This method provides effective resolution control, allowing AR models to overcome the limitations of fixed-resolution generation.

Weaknesses

1. Performance comparisons with recent models such as Lumina-mGPT and Cm3leon (or Anole), such as in segmentation-to-image tasks, would strengthen this paper. Additionally, an analysis or discussion on the potential for integration with these models would be beneficial. 2. Spatial conditions like segmentation maps and Canny edges impose strong constraints on structure diversity in generated outputs. Exploring whether some structural diversity can be incorporated within the conditional decoding

Reviewer 02Rating 8Confidence 4

Strengths

The paper has several strengths that make it compelling: The work has a very simple formulation that is elegant. There is good demonstration on how it’s better than the other obvious approach of conditional prefilling. Also, very few other work exists tackling this problem and this is, to the best of my knowledge, a novel approach for conditioning AR models. They also present class-to-image and T2I evaluations and show strong results on several datasets. Also, this direction of research discove

Weaknesses

I don't think I have found weaknesses in the work that should lead to rejection. I am curious about what would happen if certain experiments were run, and these are not very extensive. Some examples: 1. Which layers are ideal to introduce the new control layers on? Right now we have a coarse study of this but it could go deeper, although it's a lot of work that might not be super useful in the end. 2. Some output images shown in the paper show some color saturation or excess contrast - is this a

Reviewer 03Rating 6Confidence 4

Strengths

- **Important topic**: The image-controlled generation task is in general of great interest, and important for the recently re-rised research trend on AR-based image generation models. - **Simple and reasonable**: This method is a reasonable exploration towards controlled generation in image AR models with simple token feature addition in decoding. - **Good ablations** on training strategy, fusion strategy (cross-attention or addition, addition layers), control encoders. - **Good visualization r

Weaknesses

I will put all questions in this section. Note that they're not all weaknesses. 1. **Equation (4) is not written properly**, here $q_i$ represents a discrete image token, then $q_i \in [V]$, where $V$ is the vocabulary or codebook size. Then, the summation of a discrete token with another continuous feature $q_i + C_{i+1}$ in Equation (4) is not well defined. Besides, ControlAR adds up the control feature to the token feature in three intermediate layers, not the input embedding. 2. **About th

Code & Models

Repositories

hustvl/controlar
pytorchOfficial

Models

🤗
wondervictor/ControlAR
model· ♡ 3
♡ 3

Datasets

slz1/wxy-ControlAR
dataset· 98 dl
98 dl

Videos

ControlAR: Controllable Image Generation with Autoregressive Models· slideslive

Taxonomy

TopicsImage Retrieval and Classification Techniques · Medical Image Segmentation Techniques · Advanced Image and Video Retrieval Techniques

MethodsDiffusion