Context-Aware Autoregressive Models for Multi-Conditional Image Generation
Yixiao Chen, Zhiyuan Ma, Guoli Jia, Che Jiang, Jianjun Li, Bowen Zhou

TL;DR
ContextAR introduces a flexible autoregressive framework that effectively incorporates multiple conditions into image generation, achieving high controllability and competitive performance without fine-tuning.
Contribution
It proposes a novel method to embed diverse conditions into token sequences and introduces hybrid positional encodings and conditional attention for improved multi-conditional image generation.
Findings
Supports arbitrary condition combinations during inference.
Achieves competitive performance with diffusion models.
Demonstrates high controllability and versatility.
Abstract
Autoregressive transformers have recently shown impressive image generation quality and efficiency on par with state-of-the-art diffusion models. Unlike diffusion architectures, autoregressive models can naturally incorporate arbitrary modalities into a single, unified token sequence--offering a concise solution for multi-conditional image generation tasks. In this work, we propose , a flexible and effective framework for multi-conditional image generation. ContextAR embeds diverse conditions (e.g., canny edges, depth maps, poses) directly into the token sequence, preserving modality-specific semantics. To maintain spatial alignment while enhancing discrimination among different condition types, we introduce hybrid positional encodings that fuse Rotary Position Embedding with Learnable Positional Embedding. We design Conditional Context-aware Attention to reduces…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* This paper is well-motivated. In practical applications, users often wish to impose multiple constraints simultaneously (e.g., specifying pose, depth, and subject appearance). The paper clearly identifies a key limitation of parallel generation in diffusion models: when guidance signals from different conditions (like Canny edges and subject textures) are applied globally and concurrently, they can conflict, forcing the model to produce a sub-optimal, "compromised" result. The core idea of usi
**Extensibility to New Conditions:** A primary weakness concerns the extensibility to new, unseen conditions. The paper's method relies on jointly training all conditions (Canny, Depth, HED, Pose, etc.) as part of a unified sequence. This contrasts with the models like ControlNet, which allow for "plug-and-play" training and addition of new control types onto a frozen base model. The paper does not explicitly discuss how a new condition (e.g., "Scribble" or "Segmentation") could be efficiently a
S1. The proposed methods, learnable positional embedding and various attention masks for effectively leveraging multiple conditions, make sense. S2. Empirical results show promising and competitive performance.
Although I think the experiments present promising results, I have serious concerns regarding the rationale behind the key claims and experiments. W1. The key motivation lacks theoretical or empirical support. The paper claims that diffusion models inherently suffer from a “tug-of-war” between conditions, whereas autoregressive models can resolve this issue. However, there is neither theoretical nor empirical analysis to support this claim, since Section 3 is merely based on an assumption. In f
**Empirical performance**: The method achieves strong controllability (highest SSIM) and comparable image quality (FID) to state-of-the-art diffusion systems despite using a smaller AR backbone. **Unified formulation**: The design supports flexible combinations of conditions during inference without retraining, highlighting modularity in condition composition.
**Limited model novelty.** The architectural contribution appears incremental. The proposed ContextAR primarily concatenates multiple condition tokens (encoded via VQ-VAE) into a unified sequence and slightly modifies the attention pattern to restrict cross-condition interaction. Such conditioning strategies have already been explored in ControlAR (Li et al., 2024c, ICLR’25) and EditAR (Mu et al., 2025, arXiv’25). As a result, the novelty lies mostly in implementation refinements rather than in
ControlAR outperforms previous baselines on MultiGen-20M.
1. My largest reservation is that the method is poorly motivated. All of section 3 is vague. At best, this is a hypothesis with no backup theory or empirical validation. 1-1. The main claim for switching from parallel generation to sequential generation is that the conditions provide contradictory information for the same region. This also holds true for autoregressive models as the conditions are given in the same way. Generating an image patch-by-patch does not prevent this from happening. 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Domain Adaptation and Few-Shot Learning
MethodsSoftmax · Attention Is All You Need · Diffusion
