TL;DR
This paper introduces a unified framework for multimodal self-supervised learning, demonstrating its effectiveness on toy and neuroimaging datasets for Alzheimer's, and highlighting the importance of contrastive objectives and shared representations.
Contribution
It unifies multimodal self-supervised learning methods into a single taxonomy and evaluates their performance, revealing key factors for effective multimodal representation learning.
Findings
Multimodal contrastive learning outperforms unimodal methods.
The composition of contrastive objectives critically affects downstream performance.
Maximizing similarity between representations regularizes models and uncovers modality relations.
Abstract
Sensory input from multiple sources is crucial for robust and coherent human perception. Different sources contribute complementary explanatory factors. Similarly, research studies often collect multimodal imaging data, each of which can provide shared and unique information. This observation motivated the design of powerful multimodal self-supervised representation-learning algorithms. In this paper, we unify recent work on multimodal self-supervised learning under a single framework. Observing that most self-supervised methods optimize similarity metrics between a set of model components, we propose a taxonomy of all reasonable ways to organize this process. We first evaluate models on toy multimodal MNIST datasets and then apply them to a multimodal neuroimaging dataset with Alzheimer's disease patients. We find that (1) multimodal contrastive learning has significant benefits over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning · Solana Customer Service Number +1-833-534-1729
