Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
Xiaoyuan Guo, Jiali Duan, C.-C. Jay Kuo, Judy Wawira Gichoya, Imon, Banerjee

TL;DR
This paper introduces a method to discretize visual representations by learning a semantic codebook, aligning visual and language modalities for improved vision-language pretraining.
Contribution
It proposes a joint learning approach for a visual codebook to enhance modality alignment in vision-language models, extending VQ-VAE with theoretical guarantees.
Findings
Improved performance on vision-language benchmarks.
Effective discretization of visual features.
Enhanced modality alignment and fusion.
Abstract
Language modality within the vision language pretraining framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which potentially prohibits the alignment as well as fusion between vision and language modalities. We therefore propose to "discretize" the visual representation by joint learning a codebook that imbues each visual token a semantic. We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective, a counterpart of Masked Language Modeling which proves successful for language models. To optimize the codebook, we extend the formulation of VQ-VAE which gives a theoretic guarantee. Experiments validate the effectiveness of our approach across common vision-language benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsVQ-VAE
