SCALAR: Scale-wise Controllable Visual Autoregressive Learning
Ryan Xu, Dongyang Jin, Yancheng Bai, Rui Lan, Xu Duan, Lei Sun, and Xiangxiang Chu

TL;DR
SCALAR introduces a novel scale-wise conditional decoding method for visual autoregressive models, enabling fine-grained, efficient, and high-quality controllable image synthesis with multi-modal guidance.
Contribution
It proposes a new scale-wise control mechanism and a unified model for multi-modal guidance in VAR-based image generation, improving control and fidelity.
Findings
Achieves superior control precision in image synthesis
Demonstrates improved generation quality over existing methods
Supports flexible multi-conditional guidance
Abstract
Controllable image synthesis, which enables fine-grained control over generated outputs, has emerged as a key focus in visual generative modeling. However, controllable generation remains challenging for Visual Autoregressive (VAR) models due to their hierarchical, next-scale prediction style. Existing VAR-based methods often suffer from inefficient control encoding and disruptive injection mechanisms that compromise both fidelity and efficiency. In this work, we present SCALAR, a controllable generation method based on VAR, incorporating a novel Scale-wise Conditional Decoding mechanism. SCALAR leverages a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone. This design provides persistent and structurally aligned guidance throughout the generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
