Language-Guided Image Tokenization for Generation
Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid,, Dina Katabi, Xiuye Gu

TL;DR
TexTok leverages descriptive text captions to create a more compact and semantically rich image tokenization, significantly improving image reconstruction quality, compression rates, and inference speed in image generation tasks.
Contribution
The paper introduces TexTok, a novel language-conditioned image tokenizer that enhances compression and reconstruction quality by integrating text captions into the tokenization process.
Findings
Achieves 29.2% and 48.1% FID improvements on ImageNet benchmarks.
Provides up to 93.5x inference speedup with fewer tokens.
Outperforms previous methods with state-of-the-art FID scores on ImageNet.
Abstract
Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide a compact, high-level semantic representation. By conditioning the tokenization process on descriptive text captions, TexTok simplifies semantic learning, allowing more learning capacity and token space to be allocated to capture fine-grained visual details, leading to enhanced reconstruction quality and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Cell Image Analysis Techniques
MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing
