MaskBit: Embedding-free Image Generation via Bit Tokens

Mark Weber; Lijun Yu; Qihang Yu; Xueqing Deng; Xiaohui Shen; Daniel; Cremers; Liang-Chieh Chen

arXiv:2409.16211·cs.CV·December 10, 2024

MaskBit: Embedding-free Image Generation via Bit Tokens

Mark Weber, Lijun Yu, Qihang Yu, Xueqing Deng, Xiaohui Shen, Daniel, Cremers, Liang-Chieh Chen

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper introduces MaskBit, an embedding-free image generation method using bit tokens, achieving state-of-the-art results on ImageNet with a simplified model and providing a modernized VQGAN for improved image synthesis.

Contribution

The paper presents a systematic modernization of VQGANs and introduces a novel embedding-free generation network operating directly on bit tokens, advancing image synthesis techniques.

Findings

01

Achieved a new state-of-the-art FID of 1.52 on ImageNet 256x256.

02

Developed a compact generator with only 305 million parameters.

03

Provided a transparent and reproducible VQGAN model.

Abstract

Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens - a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

markweberdev/maskbit
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Video Analysis and Summarization

MethodsAttention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Softmax · Layer Normalization · Dropout · Diffusion