Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar,, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei, Chen

TL;DR
This paper explores how scaling auto-encoders, especially Vision Transformers, affects image and video reconstruction and generation, revealing complex relationships and leading to a lightweight, high-performance tokenizer called ViTok.
Contribution
The study systematically investigates the effects of scaling auto-encoder components and introduces ViTok, a scalable, efficient Vision Transformer-based tokenizer that improves reconstruction and generation tasks.
Findings
Scaling the auto-encoder bottleneck correlates with reconstruction but has complex effects on generation.
Scaling the encoder yields minimal gains for both tasks.
Scaling the decoder improves reconstruction but has mixed effects on generation.
Abstract
Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAugmented Reality Applications · Human Motion and Animation
MethodsAttention Is All You Need · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Absolute Position Encodings · Vision Transformer · Multi-Head Attention
