Unifying Vision-Language Representation Space with Single-tower   Transformer

Jiho Jang; Chaerin Kong; Donghyeon Jeon; Seonhoon Kim; Nojun Kwak

arXiv:2211.11153·cs.LG·November 22, 2022·1 cites

Unifying Vision-Language Representation Space with Single-tower Transformer

Jiho Jang, Chaerin Kong, Donghyeon Jeon, Seonhoon Kim, Nojun Kwak

PDF

Open Access 1 Video

TL;DR

This paper proposes a unified vision-language model using a single-tower transformer trained with contrastive learning, enabling modality-agnostic representations that improve multi-modal understanding tasks.

Contribution

It introduces OneR, a simple framework for learning a unified vision-language space with a single-tower transformer, addressing challenges of modality-agnostic pretraining.

Findings

01

Effective in zero-shot object localization

02

Enhances text-guided visual reasoning

03

Improves multi-modal retrieval performance

Abstract

Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this paper, we explore the bold hypothesis that an image and its caption can be simply regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a generic one-tower model for vision-language pretraining (VLP), and propose OneR as a simple yet effective framework for our goal. We discover intriguing properties that distinguish OneR from the previous works that learn modality-specific representation spaces such as zero-shot object localization, text-guided visual reasoning and multi-modal retrieval, and present analyses to provide insights into this new form of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Unifying Vision-Language Representation Space with Single-tower Transformer· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsOne Representation