QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive   Multimodal Understanding and Generation

Yue Zhao; Fuzhao Xue; Scott Reed; Linxi Fan; Yuke Zhu; Jan Kautz,; Zhiding Yu; Philipp Kr\"ahenb\"uhl; De-An Huang

arXiv:2502.05178·cs.CV·February 10, 2025

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz,, Zhiding Yu, Philipp Kr\"ahenb\"uhl, De-An Huang

PDF

Open Access 6 Models

TL;DR

QLIP introduces a novel visual tokenization method that unifies multimodal understanding and generation, achieving high-quality reconstruction and zero-shot image understanding with a single model.

Contribution

It presents a new quantized visual tokenizer that balances reconstruction and language-image alignment, enabling unified multimodal understanding and generation.

Findings

01

QLIP achieves state-of-the-art zero-shot image understanding.

02

QLIP performs comparably or better as a visual encoder and image tokenizer.

03

QLIP enables a unified model for multimodal understanding and generation.

Abstract

We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems