A survey on Self Supervised learning approaches for improving Multimodal representation learning
Naman Goyal

TL;DR
This survey reviews various self-supervised learning methods for multimodal representation learning, highlighting approaches like cross-modal generation, pretraining, cyclic translation, and unimodal label generation to enhance multimodal models.
Contribution
It provides a comprehensive overview of the state-of-the-art self-supervised techniques specifically tailored for multimodal learning, aggregating diverse methods from recent literature.
Findings
Identifies key self-supervised approaches for multimodal learning
Highlights the effectiveness of cross-modal pretraining and generation
Provides a structured taxonomy of methods
Abstract
Recently self supervised learning has seen explosive growth and use in variety of machine learning tasks because of its ability to avoid the cost of annotating large-scale datasets. This paper gives an overview for best self supervised learning approaches for multimodal learning. The presented approaches have been aggregated by extensive study of the literature and tackle the application of self supervised learning in different ways. The approaches discussed are cross modal generation, cross modal pretraining, cyclic translation, and generating unimodal labels in self supervised fashion.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Speech and dialogue systems
