CAR: Controllable Autoregressive Modeling for Visual Generation
Ziyu Yao, Jialin Li, Yifeng Zhou, Yong Liu, Xi Jiang, Chengjie Wang,, Feng Zheng, Yuexian Zou, Lei Li

TL;DR
This paper introduces CAR, a novel framework that enables fine-grained controllability in pre-trained autoregressive visual generation models, improving control, image quality, and efficiency over previous methods.
Contribution
CAR is the first control framework for pre-trained autoregressive visual models, integrating conditional control into multi-scale latent modeling for enhanced flexibility.
Findings
Demonstrates excellent controllability across various conditions.
Achieves higher image quality than previous methods.
Requires significantly fewer training resources.
Abstract
Controllable generation, which enables fine-grained control over generated outputs, has emerged as a critical focus in visual generative models. Currently, there are two primary technical approaches in visual generation: diffusion models and autoregressive models. Diffusion models, as exemplified by ControlNet and T2I-Adapter, offer advanced control mechanisms, whereas autoregressive models, despite showcasing impressive generative quality and scalability, remain underexplored in terms of controllability and flexibility. In this study, we introduce Controllable AutoRegressive Modeling (CAR), a novel, plug-and-play framework that integrates conditional control into multi-scale latent variable modeling, enabling efficient control generation within a pre-trained visual autoregressive model. CAR progressively refines and captures control representations, which are injected into each…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The motivation and formulation of the work is clear, namely investigating conditional control of AR image generation model. 2. The model show competitive performance on various conditional generation tasks. 3. The paper is clearly written and easy to follow.
Two major concerns of this work: 1. The model follows VAR which leverages multi-scale latents in generation. This greatly limits the application to broader autoregressive image generation models, where no multi-scale latent is used. 2. The other concern is the computational overhead has not been properly reported. The Transformer module in CAR is built with half of the parameters of original VAR, which can be expensive in training/inference.
1. The paper presents a novel contribution, as CAR is the first framework for controllable autoregressive image generation. 2. The analysis in Section 4.3 clearly shows that controllable autoregressive modeling functions effectively. 3. The presentation of the paper is well done, with clear equations, figures, and tables.
1. In Section 4.2 (line 643), the authors mention retraining T2I-Adapter and ControlNet but do not provide sufficient details about whether these models were trained with the same parameters and training time as CAR. Additionally, it is worth noting that both methods are based on diffusion models, which seems slightly unconventional. While there may not be directly comparable models, having a stronger baseline would be beneficial. 2. The supplementary material lacks extensive visualizations, wit
- AR-based visual generation is popular. - The proposed approach was tested with different control types.
- The "control token", "control map", and "control information" are mixed to describe $c_k$ which is very confusing. - The model architecture (such as $\mathcal{F}$ and $\mathcal{T}$) is described in the experiment section making it hard to understand the workflow of the proposed approach. - The proposed approach introduces $\mathcal{T}$ which has 0.5 #param of the original model + several fusion modules $\mathcal{F}$ making the model much larger compared to VAR. - The author claimed that CAR is
* The paper is easy to follow. * The proposed method maintains the pre-trained VAR network unchanged, and is trainable with only 8 V100. * Detailed ablation study over the network design is provided.
* Some important information is missing from the paper and there is no appendix to explain those points. See questions. * The problem solved in this paper is same as in ControlVAR, yet it has not compared with this baseline method. In fact, the results from ControlVAR is better than this paper. * Derivation of eq 3 is not obvious and not detailed information.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Advanced Vision and Imaging
MethodsFocus · Diffusion
