Language-Guided Image Tokenization for Generation

Kaiwen Zha; Lijun Yu; Alireza Fathi; David A. Ross; Cordelia Schmid,; Dina Katabi; Xiuye Gu

arXiv:2412.05796·cs.CV·April 8, 2025

Language-Guided Image Tokenization for Generation

Kaiwen Zha, Lijun Yu, Alireza Fathi, David A. Ross, Cordelia Schmid,, Dina Katabi, Xiuye Gu

PDF

Open Access

TL;DR

TexTok leverages descriptive text captions to create a more compact and semantically rich image tokenization, significantly improving image reconstruction quality, compression rates, and inference speed in image generation tasks.

Contribution

The paper introduces TexTok, a novel language-conditioned image tokenizer that enhances compression and reconstruction quality by integrating text captions into the tokenization process.

Findings

01

Achieves 29.2% and 48.1% FID improvements on ImageNet benchmarks.

02

Provides up to 93.5x inference speedup with fewer tokens.

03

Outperforms previous methods with state-of-the-art FID scores on ImageNet.

Abstract

Image tokenization, the process of transforming raw image pixels into a compact low-dimensional latent representation, has proven crucial for scalable and efficient image generation. However, mainstream image tokenization methods generally have limited compression rates, making high-resolution image generation computationally expensive. To address this challenge, we propose to leverage language for efficient image tokenization, and we call our method Text-Conditioned Image Tokenization (TexTok). TexTok is a simple yet effective tokenization framework that leverages language to provide a compact, high-level semantic representation. By conditioning the tokenization process on descriptive text captions, TexTok simplifies semantic learning, allowing more learning capacity and token space to be allocated to capture fine-grained visual details, leading to enhanced reconstruction quality and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Analysis and Summarization · Cell Image Analysis Techniques

MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing