ControlVAR: Exploring Controllable Visual Autoregressive Modeling
Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Zhe Lin, Rita Singh, Bhiksha, Raj

TL;DR
ControlVAR introduces a flexible, efficient pixel-level control framework for visual autoregressive modeling, outperforming diffusion models in conditional image generation tasks by jointly modeling image and control distributions.
Contribution
It proposes a novel joint modeling approach with a teacher-forcing strategy, enabling controllable and efficient visual generation beyond diffusion models.
Findings
Outperforms popular diffusion-based conditional models in various tasks
Enables flexible pixel-level control during image generation
Demonstrates superior efficiency and efficacy in experiments
Abstract
Conditional visual generation has witnessed remarkable progress with the advent of diffusion models (DMs), especially in tasks like control-to-image generation. However, challenges such as expensive computational cost, high inference latency, and difficulties of integration with large language models (LLMs) have necessitated exploring alternatives to DMs. This paper introduces ControlVAR, a novel framework that explores pixel-level controls in visual autoregressive (VAR) modeling for flexible and efficient conditional generation. In contrast to traditional conditional models that learn the conditional distribution, ControlVAR jointly models the distribution of image and pixel-level conditions during training and imposes conditional controls during testing. To enhance the joint modeling, we adopt the next-scale AR prediction paradigm and unify control and image representations. A…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
* The alternating prediction of image tokens and control tokens seems new. * Experiments show the effectiveness of the proposed methods. * The visualization is clear and helps illustrates the framework of the proposed method and the generation results.
* The organization of the paper lacks clarity, with some confusing aspects. For instance, at the beginning of Section 3, the notation is unclear: $C$ represents pixel-level control, while $c$ stands for token-level control, but the distinction between these two is not fully explained. Additionally, the problem formulation is introduced without any examples, making it challenging to follow. The first example of control only appears on page five, and the tokenization method is explained on page si
1. **Promising Direction for AR Model Control**: The paper addresses an important and interesting challenge—how to control autoregressive (AR) models effectively—which is valuable for the community as AR applications grow. 2. **Well-Designed Experiments**: The experiments are thoughtfully set up for various tasks, providing a clear view of the framework’s capabilities and its performance compared to popular models. 3. **Clear Writing**: The paper is well-written and easy to follow, making the te
1. **Resource-Intensive Tuning and Limited Flexibility**: My major concern is the motivation for the work. The limitation of ControlVAR is its requirement for fine-tuning the pre-trained VAR model, which reduces the method’s flexibility and scalability. Unlike diffusion models such as Stable Diffusion, where ControlNet adds control without altering the base model’s weights, ControlVAR necessitates modifications to the underlying VAR model to enable control. This limitation makes it less practica
1. This paper focuses on controllable image generation using autoregressive models, a forward-looking area with substantial application potential, providing valuable insights for the research community. 2. The paper introduces ControlVAR, which employs pixel-level controls in autoregressive modeling for controllable image generation, and employs several innovative mechanisms, such as teacher-forcing guidance (TFG) for controllable sampling. 3. The paper provides some empirical results, outperfor
1. ControlVAR requires tuning the pre-trained VAR model, which limits the flexibility of the proposed method. Switching to a new base model still requires retraining, making this approach resource-intensive. In diffusion models like Stable Diffusion, ControlNet does not modify SD’s weights. However, if there is a mature autoregressive generation model with a parameter size similar to SD in the future, ControlVAR would require retraining or fine-tuning it, which would be unacceptable in most appl
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics
MethodsDiffusion
