Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and   Next-Token Prediction

Maciej Kilian; Varun Jampani; Luke Zettlemoyer

arXiv:2405.13218·cs.CV·May 27, 2024

Computational Tradeoffs in Image Synthesis: Diffusion, Masked-Token, and Next-Token Prediction

Maciej Kilian, Varun Jampani, Luke Zettlemoyer

PDF

Open Access

TL;DR

This paper compares diffusion, masked-token, and next-token prediction methods in image synthesis, analyzing their performance and efficiency across compute budgets, and provides recommendations based on application needs.

Contribution

It offers the first direct, compute-controlled comparison of these image synthesis approaches, highlighting their scalability, performance, and efficiency differences.

Findings

01

Token prediction methods outperform diffusion in prompt following.

02

Next-token prediction is the most compute-efficient approach.

03

Diffusion matches token prediction in image quality at scale.

Abstract

Nearly every recent image synthesis approach, including diffusion, masked-token prediction, and next-token prediction, uses a Transformer network architecture. Despite this common backbone, there has been no direct, compute controlled comparison of how these approaches affect performance and efficiency. We analyze the scalability of each approach through the lens of compute budget measured in FLOPs. We find that token prediction methods, led by next-token prediction, significantly outperform diffusion on prompt following. On image quality, while next-token prediction initially performs better, scaling trends suggest it is eventually matched by diffusion. We compare the inference compute efficiency of each approach and find that next token prediction is by far the most efficient. Based on our findings we recommend diffusion for applications targeting image quality and low latency; and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Image Retrieval and Classification Techniques

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Position-Wise Feed-Forward Layer · Multi-Head Attention · Dropout · Dense Connections