Learnings from Scaling Visual Tokenizers for Reconstruction and   Generation

Philippe Hansen-Estruch; David Yan; Ching-Yao Chung; Orr Zohar,; Jialiang Wang; Tingbo Hou; Tao Xu; Sriram Vishwanath; Peter Vajda; Xinlei; Chen

arXiv:2501.09755·cs.CV·January 17, 2025

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar,, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei, Chen

PDF

Open Access

TL;DR

This paper explores how scaling auto-encoders, especially Vision Transformers, affects image and video reconstruction and generation, revealing complex relationships and leading to a lightweight, high-performance tokenizer called ViTok.

Contribution

The study systematically investigates the effects of scaling auto-encoder components and introduces ViTok, a scalable, efficient Vision Transformer-based tokenizer that improves reconstruction and generation tasks.

Findings

01

Scaling the auto-encoder bottleneck correlates with reconstruction but has complex effects on generation.

02

Scaling the encoder yields minimal gains for both tasks.

03

Scaling the decoder improves reconstruction but has mixed effects on generation.

Abstract

Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAugmented Reality Applications · Human Motion and Animation

MethodsAttention Is All You Need · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Absolute Position Encodings · Vision Transformer · Multi-Head Attention