Marrying Autoregressive Transformer and Diffusion with Multi-Reference Autoregression
Dingcheng Zhen, Qian Qiao, Xu Zheng, Tan Yu, Kangxi Wu, Ziwei Zhang, Siyuan Liu, Shunshun Yin, Ming Tao

TL;DR
TransDiff is a novel image generation model that combines autoregressive transformers with diffusion models, achieving superior quality and speed on ImageNet, and introduces Multi-Reference Autoregression for enhanced diversity and performance.
Contribution
The paper presents TransDiff, the first model to integrate autoregressive transformers with diffusion models, and introduces Multi-Reference Autoregression for improved image diversity and quality.
Findings
TransDiff achieves FID of 1.61 and IS of 293.4 on ImageNet.
TransDiff offers 2x faster inference than AR transformer-based models.
MRAR improves TransDiff's FID from 1.61 to 1.42.
Abstract
We introduce TransDiff, the first image generation model that marries Autoregressive (AR) Transformer with diffusion models. In this joint modeling framework, TransDiff encodes labels and images into high-level semantic features and employs a diffusion model to estimate the distribution of image samples. On the ImageNet 256x256 benchmark, TransDiff significantly outperforms other image generation models based on standalone AR Transformer or diffusion models. Specifically, TransDiff achieves a Frechet Inception Distance (FID) of 1.61 and an Inception Score (IS) of 293.4, and further provides x2 faster inference latency compared to state-of-the-art methods based on AR Transformer and x112 faster inference compared to diffusion-only models. Furthermore, building on the TransDiff model, we introduce a novel image generation paradigm called Multi-Reference Autoregression (MRAR), which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face Recognition and Perception
MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · Diffusion
