Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling
Wenze Liu, Le Zhuo, Yi Xin, Sheng Xia, Peng Gao, Xiangyu Yue

TL;DR
This paper introduces Set AutoRegressive Modeling (SAR), a flexible paradigm for image generation that generalizes traditional autoregressive methods by allowing arbitrary token set outputs, enabling improved inference and scalability.
Contribution
The paper proposes SAR, a new autoregressive framework that unifies and extends existing models, with a Fully Masked Transformer architecture and demonstrated benefits on ImageNet and text-to-image tasks.
Findings
SAR generalizes AR and MAR, offering flexible inference options.
Training with SAR improves image synthesis quality and efficiency.
A 900M parameter model achieves high-quality photo-realistic images.
Abstract
We introduce a new paradigm for AutoRegressive (AR) image generation, termed Set AutoRegressive Modeling (SAR). SAR generalizes the conventional AR to the next-set setting, i.e., splitting the sequence into arbitrary sets containing multiple tokens, rather than outputting each token in a fixed raster order. To accommodate SAR, we develop a straightforward architecture termed Fully Masked Transformer. We reveal that existing AR variants correspond to specific design choices of sequence order and output intervals within the SAR framework, with AR and Masked AR (MAR) as two extreme instances. Notably, SAR facilitates a seamless transition from AR to MAR, where intermediate states allow for training a causal model that benefits from both few-step inference and KV cache acceleration, thus leveraging the advantages of both AR and MAR. On the ImageNet benchmark, we carefully explore the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsColor perception and design
MethodsAttention Is All You Need · Dense Connections · Residual Connection · Dropout · Layer Normalization · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Linear Layer
