Preserving Modality Structure Improves Multi-Modal Learning
Swetha Sirnam, Mamshad Nayeem Rizve, Nina Shvetsova, Hilde Kuehne,, Mubarak Shah

TL;DR
This paper introduces a novel method for multi-modal learning that preserves semantic structure by learning multiple anchors, improving generalization and achieving state-of-the-art results on benchmark datasets.
Contribution
It proposes a Semantic-Structure-Preserving Consistency approach with a Multi-Assignment Sinkhorn-Knopp algorithm to enhance multi-modal embeddings.
Findings
Achieves state-of-the-art performance on MSR-VTT and YouCook2 datasets.
Improves generalization to out-of-domain data.
Learns semantically meaningful anchors in a self-supervised manner.
Abstract
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings. In this context, we propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space. To capture modality-specific semantic relationships between samples, we propose to learn multiple anchors and represent the multifaceted relationship between samples with respect to their relationship with these anchors. To assign multiple anchors to each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
