QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz,, Zhiding Yu, Philipp Kr\"ahenb\"uhl, De-An Huang

TL;DR
QLIP introduces a novel visual tokenization method that unifies multimodal understanding and generation, achieving high-quality reconstruction and zero-shot image understanding with a single model.
Contribution
It presents a new quantized visual tokenizer that balances reconstruction and language-image alignment, enabling unified multimodal understanding and generation.
Findings
QLIP achieves state-of-the-art zero-shot image understanding.
QLIP performs comparably or better as a visual encoder and image tokenizer.
QLIP enables a unified model for multimodal understanding and generation.
Abstract
We introduce Quantized Language-Image Pretraining (QLIP), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding. QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives. We are the first to show that the two objectives do not need to be at odds. We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective. We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model. Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems
