Factorized Visual Tokenization and Generation
Zechen Bai, Jianxiong Gao, Ziteng Gao, Pichao Wang, Zheng Zhang, Tong, He, Mike Zheng Shou

TL;DR
This paper introduces Factorized Quantization (FQ), a scalable and efficient method for visual tokenization that decomposes large codebooks into sub-codebooks with regularization and leverages pretrained models to enhance image generation.
Contribution
It proposes a novel factorization approach for VQ-based tokenizers, improving scalability, diversity, and semantic richness in visual representations.
Findings
Achieves state-of-the-art reconstruction quality.
Enhances auto-regressive image generation performance.
Reduces codebook complexity and redundancy.
Abstract
Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsLinear Layer · Residual Connection · Softmax · Attention Is All You Need · Multi-Head Attention · Dense Connections · Layer Normalization · Vision Transformer · Contrastive Language-Image Pre-training · self-DIstillation with NO labels
