Diffusion Autoencoders are Scalable Image Tokenizers
Yinbo Chen, Rohit Girdhar, Xiaolong Wang, Sai Saketh Rambhatla, Ishan, Misra

TL;DR
This paper introduces DiTo, a diffusion-based image tokenizer that simplifies training by using a single diffusion loss, achieving competitive image representations without complex heuristics or supervision.
Contribution
The paper presents a scalable, self-supervised diffusion tokenizer that simplifies training and outperforms or matches state-of-the-art image tokenizers in quality.
Findings
DiTo achieves comparable or better image reconstruction quality.
DiTo simplifies training by using only diffusion L2 loss.
DiTo is scalable and self-supervised, reducing reliance on heuristics.
Abstract
Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis
MethodsDiffusion
