Preserving Modality Structure Improves Multi-Modal Learning

Swetha Sirnam; Mamshad Nayeem Rizve; Nina Shvetsova; Hilde Kuehne,; Mubarak Shah

arXiv:2308.13077·cs.CV·August 28, 2023

Preserving Modality Structure Improves Multi-Modal Learning

Swetha Sirnam, Mamshad Nayeem Rizve, Nina Shvetsova, Hilde Kuehne,, Mubarak Shah

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel method for multi-modal learning that preserves semantic structure by learning multiple anchors, improving generalization and achieving state-of-the-art results on benchmark datasets.

Contribution

It proposes a Semantic-Structure-Preserving Consistency approach with a Multi-Assignment Sinkhorn-Knopp algorithm to enhance multi-modal embeddings.

Findings

01

Achieves state-of-the-art performance on MSR-VTT and YouCook2 datasets.

02

Improves generalization to out-of-domain data.

03

Learns semantically meaningful anchors in a self-supervised manner.

Abstract

Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings. In this context, we propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space. To capture modality-specific semantic relationships between samples, we propose to learn multiple anchors and represent the multifaceted relationship between samples with respect to their relationship with these anchors. To assign multiple anchors to each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

swetha5/multi_sinkhorn_knopp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition