Diffusion Autoencoders are Scalable Image Tokenizers

Yinbo Chen; Rohit Girdhar; Xiaolong Wang; Sai Saketh Rambhatla; Ishan; Misra

arXiv:2501.18593·cs.CV·January 31, 2025

Diffusion Autoencoders are Scalable Image Tokenizers

Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan, Misra

PDF

Open Access

TL;DR

This paper introduces DiTo, a diffusion-based image tokenizer that simplifies training by using a single diffusion loss, achieving competitive image representations without complex heuristics or supervision.

Contribution

The paper presents a scalable, self-supervised diffusion tokenizer that simplifies training and outperforms or matches state-of-the-art image tokenizers in quality.

Findings

01

DiTo achieves comparable or better image reconstruction quality.

02

DiTo simplifies training by using only diffusion L2 loss.

03

DiTo is scalable and self-supervised, reducing reliance on heuristics.

Abstract

Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis

MethodsDiffusion