ControlAR: Controllable Image Generation with Autoregressive Models
Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen,, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang

TL;DR
ControlAR introduces a novel, efficient autoregressive framework that enhances controllable image generation by integrating spatial controls through a lightweight encoder and conditional decoding, surpassing previous diffusion models.
Contribution
The paper proposes a new method for controllable image generation with AR models, using control encoding and conditional decoding to improve control and efficiency.
Findings
ControlAR achieves superior controllability over diverse spatial inputs.
It enables arbitrary-resolution image generation with high quality.
Outperforms state-of-the-art controllable diffusion models in experiments.
Abstract
Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control…
Peer Reviews
Decision·ICLR 2025 Poster
1. The proposed method enables fine-grained control in autoregressive image generation by using a control encoder and conditional decoding, achieving high image quality with low additional training cost. 2. This method provides effective resolution control, allowing AR models to overcome the limitations of fixed-resolution generation.
1. Performance comparisons with recent models such as Lumina-mGPT and Cm3leon (or Anole), such as in segmentation-to-image tasks, would strengthen this paper. Additionally, an analysis or discussion on the potential for integration with these models would be beneficial. 2. Spatial conditions like segmentation maps and Canny edges impose strong constraints on structure diversity in generated outputs. Exploring whether some structural diversity can be incorporated within the conditional decoding
The paper has several strengths that make it compelling: The work has a very simple formulation that is elegant. There is good demonstration on how it’s better than the other obvious approach of conditional prefilling. Also, very few other work exists tackling this problem and this is, to the best of my knowledge, a novel approach for conditioning AR models. They also present class-to-image and T2I evaluations and show strong results on several datasets. Also, this direction of research discove
I don't think I have found weaknesses in the work that should lead to rejection. I am curious about what would happen if certain experiments were run, and these are not very extensive. Some examples: 1. Which layers are ideal to introduce the new control layers on? Right now we have a coarse study of this but it could go deeper, although it's a lot of work that might not be super useful in the end. 2. Some output images shown in the paper show some color saturation or excess contrast - is this a
- **Important topic**: The image-controlled generation task is in general of great interest, and important for the recently re-rised research trend on AR-based image generation models. - **Simple and reasonable**: This method is a reasonable exploration towards controlled generation in image AR models with simple token feature addition in decoding. - **Good ablations** on training strategy, fusion strategy (cross-attention or addition, addition layers), control encoders. - **Good visualization r
I will put all questions in this section. Note that they're not all weaknesses. 1. **Equation (4) is not written properly**, here $q_i$ represents a discrete image token, then $q_i \in [V]$, where $V$ is the vocabulary or codebook size. Then, the summation of a discrete token with another continuous feature $q_i + C_{i+1}$ in Equation (4) is not well defined. Besides, ControlAR adds up the control feature to the token feature in three intermediate layers, not the input embedding. 2. **About th
Code & Models
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques · Medical Image Segmentation Techniques · Advanced Image and Video Retrieval Techniques
MethodsDiffusion
