MonoFormer: One Transformer for Both Diffusion and Autoregression

Chuyang Zhao; Yuxing Song; Wenhao Wang; Haocheng Feng; Errui Ding,; Yifan Sun; Xinyan Xiao; Jingdong Wang

arXiv:2409.16280·cs.CV·September 25, 2024

MonoFormer: One Transformer for Both Diffusion and Autoregression

Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding,, Yifan Sun, Xinyan Xiao, Jingdong Wang

PDF

Open Access 1 Repo

TL;DR

MonoFormer introduces a unified transformer model that effectively handles both diffusion-based visual generation and autoregressive text generation, simplifying multimodal generation architectures.

Contribution

It demonstrates that sharing a single transformer for both diffusion and autoregression is feasible and achieves competitive performance.

Findings

01

Achieves comparable image generation quality to state-of-the-art methods.

02

Maintains effective text generation capabilities.

03

Simplifies multimodal generation architecture.

Abstract

Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MonoFormer/MonoFormer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSensor Technology and Measurement Systems · Induction Heating and Inverter Technology · Neural Networks and Applications

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Diffusion