Augmenting Vision Language Pretraining by Learning Codebook with Visual   Semantics

Xiaoyuan Guo; Jiali Duan; C.-C. Jay Kuo; Judy Wawira Gichoya; Imon; Banerjee

arXiv:2208.00475·cs.CV·August 2, 2022

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Xiaoyuan Guo, Jiali Duan, C.-C. Jay Kuo, Judy Wawira Gichoya, Imon, Banerjee

PDF

Open Access

TL;DR

This paper introduces a method to discretize visual representations by learning a semantic codebook, aligning visual and language modalities for improved vision-language pretraining.

Contribution

It proposes a joint learning approach for a visual codebook to enhance modality alignment in vision-language models, extending VQ-VAE with theoretical guarantees.

Findings

01

Improved performance on vision-language benchmarks.

02

Effective discretization of visual features.

03

Enhanced modality alignment and fusion.

Abstract

Language modality within the vision language pretraining framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which potentially prohibits the alignment as well as fusion between vision and language modalities. We therefore propose to "discretize" the visual representation by joint learning a codebook that imbues each visual token a semantic. We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective, a counterpart of Masked Language Modeling which proves successful for language models. To optimize the codebook, we extend the formulation of VQ-VAE which gives a theoretic guarantee. Experiments validate the effectiveness of our approach across common vision-language benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsVQ-VAE