Dense Contrastive Visual-Linguistic Pretraining

Lei Shi; Kai Shuang; Shijie Geng; Peng Gao; Zuohui Fu; Gerard de Melo,; Yunpeng Chen; Sen Su

arXiv:2109.11778·cs.CV·September 27, 2021

Dense Contrastive Visual-Linguistic Pretraining

Lei Shi, Kai Shuang, Shijie Geng, Peng Gao, Zuohui Fu, Gerard de Melo,, Yunpeng Chen, Sen Su

PDF

Open Access

TL;DR

This paper introduces DCVLP, a self-supervised multimodal pretraining method that uses dense contrastive learning to improve visual-linguistic representations without relying on noisy annotations.

Contribution

It proposes a novel dense contrastive learning framework for multimodal pretraining that eliminates the need for semantic annotations and improves representation quality.

Findings

01

Outperforms prior methods in multimodal tasks

02

Effectively handles noisy labels and sparse annotations

03

Enhances cross-modal dense region understanding

Abstract

Inspired by the success of BERT, several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. In particular, LXMERT and UNITER adopt visual region feature regression and label classification as pretext tasks. However, they tend to suffer from the problems of noisy labels and sparse semantic annotations, based on the visual features having been pretrained on a crowdsourced dataset with limited and inconsistent semantic labeling. To overcome these issues, we propose unbiased Dense Contrastive Visual-Linguistic Pretraining (DCVLP), which replaces the region regression and classification with cross-modality region contrastive learning that requires no annotations. Two data augmentation strategies (Mask…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Contrastive Learning · Learning Cross-Modality Encoder Representations from Transformers · UNiversal Image-TExt Representation Learning · Attention Dropout · Weight Decay · Layer Normalization · Linear Warmup With Linear Decay