MaskBit: Embedding-free Image Generation via Bit Tokens
Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel, Cremers, Liang-Chieh Chen

TL;DR
This paper introduces MaskBit, an embedding-free image generation method using bit tokens, achieving state-of-the-art results on ImageNet with a simplified model and providing a modernized VQGAN for improved image synthesis.
Contribution
The paper presents a systematic modernization of VQGANs and introduces a novel embedding-free generation network operating directly on bit tokens, advancing image synthesis techniques.
Findings
Achieved a new state-of-the-art FID of 1.52 on ImageNet 256x256.
Developed a compact generator with only 305 million parameters.
Provided a transparent and reproducible VQGAN model.
Abstract
Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗markweber/maskbit_tokenizer_10bitmodel
- 🤗markweber/maskbit_tokenizer_12bitmodel
- 🤗markweber/maskbit_tokenizer_14bitmodel
- 🤗markweber/maskbit_tokenizer_16bitmodel
- 🤗markweber/maskbit_tokenizer_18bitmodel
- 🤗markweber/vqgan_plus_papermodel
- 🤗markweber/vqgan_plus_12bitmodel
- 🤗markweber/maskbit_generator_10bitmodel
- 🤗markweber/maskbit_generator_12bitmodel
- 🤗markweber/maskbit_generator_14bitmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Video Analysis and Summarization
MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Diffusion
