TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data

Pengda Qin; Yuhong Li; Kefeng Deng; Qiang Wu

arXiv:2106.01797·cs.CL·June 15, 2021·1 cites

TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data

Pengda Qin, Yuhong Li, Kefeng Deng, Qiang Wu

PDF

Open Access

TL;DR

TVDIM is a self-supervised learning method that leverages noisy text data alongside images to improve visual representations, outperforming previous methods in image classification tasks.

Contribution

The paper introduces a novel mutual information maximization approach using multimodal data, incorporating ranking-based contrastive learning to handle noisy text-image pairs.

Findings

01

Significantly outperforms previous self-supervised methods on image classification.

02

Effectively utilizes multimodal data for better visual feature learning.

03

Demonstrates robustness to noisy text data in visual representation learning.

Abstract

Among ubiquitous multimodal data in the real world, text is the modality generated by human, while image reflects the physical world honestly. In a visual understanding application, machines are expected to understand images like human. Inspired by this, we propose a novel self-supervised learning method, named Text-enhanced Visual Deep InfoMax (TVDIM), to learn better visual representations by fully utilizing the naturally-existing multimodal data. Our core idea of self-supervised learning is to maximize the mutual information between features extracted from multiple views of a shared context to a rational degree. Different from previous methods which only consider multiple views from a single modality, our work produces multiple views from different modalities, and jointly optimizes the mutual information for features pairs of intra-modality and inter-modality. Considering the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Learning