Scaling Image Tokenizers with Grouped Spherical Quantization

Jiangtao Wang; Zhen Qin; Yifan Zhang; Vincent Tao Hu; Bj\"orn Ommer,; Rania Briq; Stefan Kesselheim

arXiv:2412.02632·cs.CV·December 5, 2024

Scaling Image Tokenizers with Grouped Spherical Quantization

Jiangtao Wang, Zhen Qin, Yifan Zhang, Vincent Tao Hu, Bj\"orn Ommer,, Rania Briq, Stefan Kesselheim

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper introduces Grouped Spherical Quantization (GSQ), a novel method for scalable image tokenization that improves reconstruction quality and efficiency, enabling effective high-dimensional latent space representation and scaling.

Contribution

The paper proposes GSQ with spherical codebook initialization and lookup regularization, providing a new approach for scalable and high-quality image tokenization.

Findings

01

GSQ-GAN outperforms state-of-the-art methods in reconstruction quality.

02

GSQ enables efficient high-dimensional latent space representation.

03

Achieved 16x down-sampling with a reconstruction FID of 0.50.

Abstract

Vision tokenizers have gained a lot of attraction due to their scalability and compactness; previous works depend on old-school GAN-based hyperparameters, biased comparisons, and a lack of comprehensive analysis of the scaling behaviours. To tackle those issues, we introduce Grouped Spherical Quantization (GSQ), featuring spherical codebook initialization and lookup regularization to constrain codebook latent to a spherical surface. Our empirical analysis of image tokenizer training strategies demonstrates that GSQ-GAN achieves superior reconstruction quality over state-of-the-art methods with fewer training iterations, providing a solid foundation for scaling studies. Building on this, we systematically examine the scaling behaviours of GSQ, specifically in latent dimensionality, codebook size, and compression ratios, and their impact on model performance. Our findings reveal distinct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

helmholtzai-fzj/flex_gen
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques