Factorized Visual Tokenization and Generation

Zechen Bai; Jianxiong Gao; Ziteng Gao; Pichao Wang; Zheng Zhang; Tong; He; Mike Zheng Shou

arXiv:2411.16681·cs.CV·November 28, 2024

Factorized Visual Tokenization and Generation

Zechen Bai, Jianxiong Gao, Ziteng Gao, Pichao Wang, Zheng Zhang, Tong, He, Mike Zheng Shou

PDF

Open Access

TL;DR

This paper introduces Factorized Quantization (FQ), a scalable and efficient method for visual tokenization that decomposes large codebooks into sub-codebooks with regularization and leverages pretrained models to enhance image generation.

Contribution

It proposes a novel factorization approach for VQ-based tokenizers, improving scalability, diversity, and semantic richness in visual representations.

Findings

01

Achieves state-of-the-art reconstruction quality.

02

Enhances auto-regressive image generation performance.

03

Reduces codebook complexity and redundancy.

Abstract

Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques

MethodsLinear Layer · Residual Connection · Softmax · Attention Is All You Need · Multi-Head Attention · Dense Connections · Layer Normalization · Vision Transformer · Contrastive Language-Image Pre-training · self-DIstillation with NO labels