2D Gaussians Meet Visual Tokenizer
Yiang Shi, Xiaoyang Guo, Wei Yin, Mingkai Jia, Qian Zhang, Xiaolin Hu, Wenyu Liu, Xinggang Wang

TL;DR
This paper introduces Visual Gaussian Quantization (VGQ), a novel image tokenizer that models geometric structures using 2D Gaussians, significantly improving image reconstruction quality over existing patch-based methods.
Contribution
VGQ explicitly incorporates 2D Gaussian distributions into visual tokenization, enhancing structural modeling and reconstruction fidelity beyond traditional quantization methods.
Findings
VGQ achieves an rFID score of 1.00 on ImageNet 256x256.
VGQ outperforms existing methods with an rFID of 0.556.
Increasing Gaussian density improves reconstruction quality.
Abstract
The image tokenizer is a critical component in AR image generation, as it determines how rich and structured visual content is encoded into compact representations. Existing quantization-based tokenizers such as VQ-GAN primarily focus on appearance features like texture and color, often neglecting geometric structures due to their patch-based design. In this work, we explored how to incorporate more visual information into the tokenizer and proposed a new framework named Visual Gaussian Quantization (VGQ), a novel tokenizer paradigm that explicitly enhances structural modeling by integrating 2D Gaussians into traditional visual codebook quantization frameworks. Our approach addresses the inherent limitations of naive quantization methods such as VQ-GAN, which struggle to model structured visual information due to their patch-based design and emphasis on texture and color. In contrast,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques
