TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data
Pengda Qin, Yuhong Li, Kefeng Deng, Qiang Wu

TL;DR
TVDIM is a self-supervised learning method that leverages noisy text data alongside images to improve visual representations, outperforming previous methods in image classification tasks.
Contribution
The paper introduces a novel mutual information maximization approach using multimodal data, incorporating ranking-based contrastive learning to handle noisy text-image pairs.
Findings
Significantly outperforms previous self-supervised methods on image classification.
Effectively utilizes multimodal data for better visual feature learning.
Demonstrates robustness to noisy text data in visual representation learning.
Abstract
Among ubiquitous multimodal data in the real world, text is the modality generated by human, while image reflects the physical world honestly. In a visual understanding application, machines are expected to understand images like human. Inspired by this, we propose a novel self-supervised learning method, named Text-enhanced Visual Deep InfoMax (TVDIM), to learn better visual representations by fully utilizing the naturally-existing multimodal data. Our core idea of self-supervised learning is to maximize the mutual information between features extracted from multiple views of a shared context to a rational degree. Different from previous methods which only consider multiple views from a single modality, our work produces multiple views from different modalities, and jointly optimizes the mutual information for features pairs of intra-modality and inter-modality. Considering the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Learning
